Unable to extract complete table in Document understanding

rds0511 · October 29, 2023, 2:56am

Hi Team,

I am able to extract individual table from pdf. But I am unable to extract complete table
from pdf. In the attached example, i am able to get table including days data day1, day 2 etc.

But I want to extract complete table which include month , day and program data details and get all data in excel file. But it’s not working. Please assist .

DU.pdf (22.4 KB)

Thanks,
Rds

rds0511 · October 29, 2023, 3:06am

@Anil_G @Lahiru.Fernando @nisargkadam23 Can you Assist. This Subheader tabular data is creating issue.

Thanks,
Rds

Dilli_Reddy · October 29, 2023, 3:11am

@rds0511

-Start by using the “Read PDF Text” activity in UiPath to extract the text content of the entire PDF file into a string variable.
-Once you have the PDF text in a variable, you can use string manipulation and regular expressions to identify and extract the table structure from the text. Look for patterns that can help you separate the month, day, and program data. Regular expressions can be especially useful for this purpose.

-After extracting the table structure, you should organize the data into a structured format, such as a data table or a list of lists, with columns for month, day, and program data.
-Use the “Write Range” activity to write the structured data to an Excel file. You can specify the Excel file and the sheet name to write the data into.

OR

Use “Read PDF Text” to extract the text content from the PDF and store it in a variable.
Use regular expressions to extract the table data.
Loop through the extracted data and organize it into a structured format, such as a DataTable.
Use the “Build Data Table” activity to create a DataTable with appropriate columns (Month, Day, Program).
Use the “Add Data Row” activity within a loop to add each row of data to the DataTable.
Use the “Write Range” activity to write the DataTable to an Excel file.

rds0511 · October 29, 2023, 3:14am

Thanks @Dilli_Reddy for your reply. Actually I have just given sample in attachment. Its complex pdf.

And I am using Document understanding there.
so through DU i want to extract complete table. That help i need here.

I am able to extract data from individual table. But complete table data I am not able to extract through DU in one go. For example day data , I can extract by highlighting that part. But when I am trying to include August, then it will not extract day data.

Anil_G · October 29, 2023, 4:59am

@rds0511

Try indicating august as a column instead of row and then try to extract it…so that for each row you get august…

As there are multiple headers it might be difficult…so you can try to do this and then in excel you can write august as header on top

Cheers

rds0511 · October 29, 2023, 5:09am

Thanks @Anil_G for reply. I was able to extract August value. But data for complete Aug month lies in days wise columns day1 day 2 day3… etc day 31. There I am facing issues.

Also just like August, there will be months data of September and October header with their daily data.

if I extract August column its not giving me day wise data. Thats issue

Anil_G · October 29, 2023, 5:10am

@rds0511

I understand that…I believe you indicated august as part of the column in table…what I am saying is to indicate it as a value instead of column…and indicate remaining table as a table and extract…that way august will be appended to each row of the table you extract…

Cheers

rds0511 · October 29, 2023, 6:54am

same way you want me to extract other months… as month header as value and then highlighting their daily data… m on it

Anil_G · October 29, 2023, 7:00am

@rds0511

Exactly…

Like just think of the month as a spearate value and not from table…generally if you extract something like a separate one with table…then it would be added separately to each row…which then can be converted

Cheers

rds0511 · October 29, 2023, 4:51pm

Month header I extracted. But there is relation between program name and daily data. I have
to do calculation further on that basis of data.

if you see my attachment, first i need to calculate how many programs are there and then for each
program i need to use daily data to perform calculation.

Is there any way so that I can extract all tabular data in one go in same format as that of pdf.

Regards

supermanPunch · October 29, 2023, 7:47pm

Hi @rds0511 ,

I believe this is not quite possible or even if possible now, maybe not recommended as you are also providing relations with row labelling telling the Model that it is a row then it would also get confused on what are the data for the columns day1, day2,day3 etc…

It is best to keep the meaning of the values synced with their labelling and try to extract it as columns, we could then post process keep it in the format required i.e after extraction.

So as @Anil_G mentioned, you would require to capture the Month values of the Table (Considering there are more tables in a Single page) as a different column, As also the Program Column to be separate.

If there is only a single table in a Page always, then you can label the Month value as a regular field.

rds0511 · October 29, 2023, 7:53pm

Thanks for your reply. There is only one table. But there are multiple subheader columns. And corresponding to each row…there will be day wise data. Even I extract daily data separately, how i will match that to corresponding program row.

And all these are dynamic. May be there can be n number of months and it can be possible there are n no. of programs rows.

If you can suggest best solution approach?

Anil_G · October 30, 2023, 2:28am

@rds0511

Can you show how the data is extracted when you extract the month as a separate vaue insteqd of indicating as a table value?

And what is the final expected result for you…instead of having as column header?

Because even if you get as header I believe to identify you need to do manipulations and check the value when needed

We can do manipulations to extracted table to get the required output for you

Cheers

rds0511 · October 30, 2023, 3:33am

yes i will attach that. I have created in enterprise edition.

But problem for me , it get individual parts of table. Say for program header i given " program" as header and get all outputs. And daily data table i extracted.

But to set correlation between these tables and calculations , it seems not feasible for me.

program 1 ----------- validate all daily data ----------do caluclation
program 2------------ validate all daily data----------do calculation

I can be wrong in this. Can you share your inputs. Because all I was doing extracting parts of big tables in chunks. Actual pdf file is having mutliple headers and corresponding sub headers.

Do we have any UiPath machine learning skill to extract the exact tabular format from pdf?

Regards

Anil_G · October 30, 2023, 3:39am

@rds0511

Except the month part try to extract remaining full table as one…

That should be possible by indicating the second row after the month row…

If that is not feasible pr data is not correct…then I believe program data is present only once per each row and not multiple rows…and it is present in each row…so we can extract both separately and using the row index we can merge the datatable together

For extracting complex tables we dont have any inbuilt algo in ML

Hope this helps

Cheers

rds0511 · October 30, 2023, 3:50am

Thanks …let me share what i Created. But no. of program rows will vary and same we can have multiple months data as well.

Anil_G · October 30, 2023, 4:16pm

@rds0511

Please show the output with what you trained and what is needed so thatw e can do manipulation accordingly …

Cheers

Topic		Replies	Views
Extract table from pdf as it is Activities pdf , studio	15	6983	March 4, 2024
Extract data table from PDF to Excel Help datatable , excel , pdf	8	5718	September 12, 2018
How to extract table data from pdf RPA Discussions general	10	3744	April 23, 2022
How to extract a table from pdf to excel Studio excel , activities	18	6585	July 19, 2023
How to Extract DataTable from Pdf Help activities , studio	4	4949	October 17, 2018

Most Active Users - Yesterday
ashokkarale
Anil_G
Ruban_shanmugam
Lalit_Chaudhari
eyashb
sonaliaggarwal47
PWilliams
AzeemK
Juan_Hkahfi
More details...

Unable to extract complete table in Document understanding

Related topics