Unable to extract complete table in Document understanding

Hi Team,

I am able to extract individual table from pdf. But I am unable to extract complete table
from pdf. In the attached example, i am able to get table including days data day1, day 2 etc.

But I want to extract complete table which include month , day and program data details and get all data in excel file. But it’s not working. Please assist .

DU.pdf (22.4 KB)

Thanks,
Rds

@Anil_G @Lahiru.Fernando @nisargkadam23 Can you Assist. This Subheader tabular data is creating issue.

Thanks,
Rds

@rds0511

-Start by using the “Read PDF Text” activity in UiPath to extract the text content of the entire PDF file into a string variable.
-Once you have the PDF text in a variable, you can use string manipulation and regular expressions to identify and extract the table structure from the text. Look for patterns that can help you separate the month, day, and program data. Regular expressions can be especially useful for this purpose.

-After extracting the table structure, you should organize the data into a structured format, such as a data table or a list of lists, with columns for month, day, and program data.
-Use the “Write Range” activity to write the structured data to an Excel file. You can specify the Excel file and the sheet name to write the data into.

OR

  1. Use “Read PDF Text” to extract the text content from the PDF and store it in a variable.

  2. Use regular expressions to extract the table data.

  3. Loop through the extracted data and organize it into a structured format, such as a DataTable.

  4. Use the “Build Data Table” activity to create a DataTable with appropriate columns (Month, Day, Program).

  5. Use the “Add Data Row” activity within a loop to add each row of data to the DataTable.

  6. Use the “Write Range” activity to write the DataTable to an Excel file.

Thanks @Dilli_Reddy for your reply. Actually I have just given sample in attachment. Its complex pdf.

And I am using Document understanding there.
so through DU i want to extract complete table. That help i need here.

I am able to extract data from individual table. But complete table data I am not able to extract through DU in one go. For example day data , I can extract by highlighting that part. But when I am trying to include August, then it will not extract day data.

@rds0511

Try indicating august as a column instead of row and then try to extract it…so that for each row you get august…

As there are multiple headers it might be difficult…so you can try to do this and then in excel you can write august as header on top

Cheers

Thanks @Anil_G for reply. I was able to extract August value. But data for complete Aug month lies in days wise columns day1 day 2 day3… etc day 31. There I am facing issues.

Also just like August, there will be months data of September and October header with their daily data.

if I extract August column its not giving me day wise data. Thats issue

@rds0511

I understand that…I believe you indicated august as part of the column in table…what I am saying is to indicate it as a value instead of column…and indicate remaining table as a table and extract…that way august will be appended to each row of the table you extract…

Cheers

same way you want me to extract other months… as month header as value and then highlighting their daily data… m on it

1 Like

@rds0511

Exactly…

Like just think of the month as a spearate value and not from table…generally if you extract something like a separate one with table…then it would be added separately to each row…which then can be converted

Cheers

Month header I extracted. But there is relation between program name and daily data. I have
to do calculation further on that basis of data.

if you see my attachment, first i need to calculate how many programs are there and then for each
program i need to use daily data to perform calculation.

Is there any way so that I can extract all tabular data in one go in same format as that of pdf.

Regards

Hi @rds0511 ,

I believe this is not quite possible or even if possible now, maybe not recommended as you are also providing relations with row labelling telling the Model that it is a row then it would also get confused on what are the data for the columns day1, day2,day3 etc…

It is best to keep the meaning of the values synced with their labelling and try to extract it as columns, we could then post process keep it in the format required i.e after extraction.

So as @Anil_G mentioned, you would require to capture the Month values of the Table (Considering there are more tables in a Single page) as a different column, As also the Program Column to be separate.

If there is only a single table in a Page always, then you can label the Month value as a regular field.

Thanks for your reply. There is only one table. But there are multiple subheader columns. And corresponding to each row…there will be day wise data. Even I extract daily data separately, how i will match that to corresponding program row.

And all these are dynamic. May be there can be n number of months and it can be possible there are n no. of programs rows.

If you can suggest best solution approach?

@rds0511

Can you show how the data is extracted when you extract the month as a separate vaue insteqd of indicating as a table value?

And what is the final expected result for you…instead of having as column header?

Because even if you get as header I believe to identify you need to do manipulations and check the value when needed

We can do manipulations to extracted table to get the required output for you

Cheers

yes i will attach that. I have created in enterprise edition.

But problem for me , it get individual parts of table. Say for program header i given " program" as header and get all outputs. And daily data table i extracted.

But to set correlation between these tables and calculations , it seems not feasible for me.

program 1 ----------- validate all daily data ----------do caluclation
program 2------------ validate all daily data----------do calculation

I can be wrong in this. Can you share your inputs. Because all I was doing extracting parts of big tables in chunks. Actual pdf file is having mutliple headers and corresponding sub headers.

Do we have any UiPath machine learning skill to extract the exact tabular format from pdf?

Regards

@rds0511

Except the month part try to extract remaining full table as one…

That should be possible by indicating the second row after the month row…

If that is not feasible pr data is not correct…then I believe program data is present only once per each row and not multiple rows…and it is present in each row…so we can extract both separately and using the row index we can merge the datatable together

For extracting complex tables we dont have any inbuilt algo in ML

Hope this helps

Cheers

Thanks …let me share what i Created. But no. of program rows will vary and same we can have multiple months data as well.

@rds0511

Please show the output with what you trained and what is needed so thatw e can do manipulation accordingly …

Cheers