I am able to extract individual table from pdf. But I am unable to extract complete table
from pdf. In the attached example, i am able to get table including days data day1, day 2 etc.
But I want to extract complete table which include month , day and program data details and get all data in excel file. But it’s not working. Please assist .
-Start by using the “Read PDF Text” activity in UiPath to extract the text content of the entire PDF file into a string variable.
-Once you have the PDF text in a variable, you can use string manipulation and regular expressions to identify and extract the table structure from the text. Look for patterns that can help you separate the month, day, and program data. Regular expressions can be especially useful for this purpose.
-After extracting the table structure, you should organize the data into a structured format, such as a data table or a list of lists, with columns for month, day, and program data.
-Use the “Write Range” activity to write the structured data to an Excel file. You can specify the Excel file and the sheet name to write the data into.
OR
Use “Read PDF Text” to extract the text content from the PDF and store it in a variable.
Use regular expressions to extract the table data.
Loop through the extracted data and organize it into a structured format, such as a DataTable.
Use the “Build Data Table” activity to create a DataTable with appropriate columns (Month, Day, Program).
Use the “Add Data Row” activity within a loop to add each row of data to the DataTable.
Use the “Write Range” activity to write the DataTable to an Excel file.
Thanks @Dilli_Reddy for your reply. Actually I have just given sample in attachment. Its complex pdf.
And I am using Document understanding there.
so through DU i want to extract complete table. That help i need here.
I am able to extract data from individual table. But complete table data I am not able to extract through DU in one go. For example day data , I can extract by highlighting that part. But when I am trying to include August, then it will not extract day data.
Thanks @Anil_G for reply. I was able to extract August value. But data for complete Aug month lies in days wise columns day1 day 2 day3… etc day 31. There I am facing issues.
Also just like August, there will be months data of September and October header with their daily data.
if I extract August column its not giving me day wise data. Thats issue
I understand that…I believe you indicated august as part of the column in table…what I am saying is to indicate it as a value instead of column…and indicate remaining table as a table and extract…that way august will be appended to each row of the table you extract…
Like just think of the month as a spearate value and not from table…generally if you extract something like a separate one with table…then it would be added separately to each row…which then can be converted
Month header I extracted. But there is relation between program name and daily data. I have
to do calculation further on that basis of data.
if you see my attachment, first i need to calculate how many programs are there and then for each
program i need to use daily data to perform calculation.
Is there any way so that I can extract all tabular data in one go in same format as that of pdf.
I believe this is not quite possible or even if possible now, maybe not recommended as you are also providing relations with row labelling telling the Model that it is a row then it would also get confused on what are the data for the columns day1, day2,day3 etc…
It is best to keep the meaning of the values synced with their labelling and try to extract it as columns, we could then post process keep it in the format required i.e after extraction.
So as @Anil_G mentioned, you would require to capture the Month values of the Table (Considering there are more tables in a Single page) as a different column, As also the Program Column to be separate.
If there is only a single table in a Page always, then you can label the Month value as a regular field.
Thanks for your reply. There is only one table. But there are multiple subheader columns. And corresponding to each row…there will be day wise data. Even I extract daily data separately, how i will match that to corresponding program row.
And all these are dynamic. May be there can be n number of months and it can be possible there are n no. of programs rows.
yes i will attach that. I have created in enterprise edition.
But problem for me , it get individual parts of table. Say for program header i given " program" as header and get all outputs. And daily data table i extracted.
But to set correlation between these tables and calculations , it seems not feasible for me.
program 1 ----------- validate all daily data ----------do caluclation
program 2------------ validate all daily data----------do calculation
I can be wrong in this. Can you share your inputs. Because all I was doing extracting parts of big tables in chunks. Actual pdf file is having mutliple headers and corresponding sub headers.
Do we have any UiPath machine learning skill to extract the exact tabular format from pdf?
Except the month part try to extract remaining full table as one…
That should be possible by indicating the second row after the month row…
If that is not feasible pr data is not correct…then I believe program data is present only once per each row and not multiple rows…and it is present in each row…so we can extract both separately and using the row index we can merge the datatable together
For extracting complex tables we dont have any inbuilt algo in ML