Table data extraction from PDF documents

Hi Team,

Can anyone please assist me to extract the table values and update to the excel file
from the attached PDF file

Regards,

Vidhan
Period 2023 06 TX-State of Texas -TSS-Base-GPS.pdf (13.1 KB)

Period 2023 06 TX-Tiers.pdf (18.5 KB)

Hi @vidhan.rpa ,

Could you also maybe provide us with the Expected Output data ? We could get an idea on what data is not required to be captured and therefore reduce the effort time.

Output Format.xlsx (13.2 KB)

Please find the output format

Hello @supermanPunch
I have shared the expected output format, is there any way to extract table data from the attached PDF

@vidhan.rpa ,

Could you check the below workflow :
PDF_Extract_TableData.zip (52.8 KB)

The methods and the explanation to it are similar to another post where a similar extraction requirement was needed. You can find the post below :

  1. Firstly, the Output Datatable is created and prepared by using Build Datatable and Adding required Column Names using a For Each and Add Data Column activity.

  2. Next, Each of the values relative to the whole document is extracted (Invoice Number, Invoice Date,…)

  3. Next, We filter and keep only the Table data for extracting each row of the table. We use the below regex for Table part Extraction :

(?<=Incentive)[\S\s]+(?<=Total Fees)

The Data within Incentive and Total Fees keywords are extracted.

  1. We then use the Extracted Table Part for further extraction of each row data using the below regex :
(?=^\s+(\d+))[\S\s]+?(?=^\s+\d+|Total Fees)
  1. Next, Within Each row match, we identify the required values using the below Regex which has named groups for easy notice and retrieval
\s+(?<Item>\d+)\s{2,}(?<Name>.*?)\s{2,}(\d+)\s{2,}(?<FY>\d+\s\/\s\d+)\s{2,}(?<Rate>.*?)\s{2,}(?<NOU>.*?)\s{2,}(?<ATR>.*?)\s{2,}(?<ATU>.*?)\s{2,}(?<RI>.*?)\s{2,}(?<EXP>.*?)\s{2,}(?<FEES>.*)
  1. Other assumption is that the words Deloitte, Hourly, Accounting is not needed after the First Line data is captured, hence we use the below regex to replace it with empty values :
\(Deloitte|Accounting|Period|\)|\(Hourly\)|\(

Check the workflow and test it with different data samples that you have and let us know if it does not match for any of the data.

Thanks a lot Brother,

I learnt from you and you helped me lot. Thank you again. God bless you!

Hi Sir,

In addition to the above solution, I have tested with various sample but I need few more help in terms of other sample where I am getting challenges, could you please assist.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.