Table data extraction from PDF documents

@vidhan.rpa ,

Could you check the below workflow :
PDF_Extract_TableData.zip (52.8 KB)

The methods and the explanation to it are similar to another post where a similar extraction requirement was needed. You can find the post below :

  1. Firstly, the Output Datatable is created and prepared by using Build Datatable and Adding required Column Names using a For Each and Add Data Column activity.

  2. Next, Each of the values relative to the whole document is extracted (Invoice Number, Invoice Date,…)

  3. Next, We filter and keep only the Table data for extracting each row of the table. We use the below regex for Table part Extraction :

(?<=Incentive)[\S\s]+(?<=Total Fees)

The Data within Incentive and Total Fees keywords are extracted.

  1. We then use the Extracted Table Part for further extraction of each row data using the below regex :
(?=^\s+(\d+))[\S\s]+?(?=^\s+\d+|Total Fees)
  1. Next, Within Each row match, we identify the required values using the below Regex which has named groups for easy notice and retrieval
\s+(?<Item>\d+)\s{2,}(?<Name>.*?)\s{2,}(\d+)\s{2,}(?<FY>\d+\s\/\s\d+)\s{2,}(?<Rate>.*?)\s{2,}(?<NOU>.*?)\s{2,}(?<ATR>.*?)\s{2,}(?<ATU>.*?)\s{2,}(?<RI>.*?)\s{2,}(?<EXP>.*?)\s{2,}(?<FEES>.*)
  1. Other assumption is that the words Deloitte, Hourly, Accounting is not needed after the First Line data is captured, hence we use the below regex to replace it with empty values :
\(Deloitte|Accounting|Period|\)|\(Hourly\)|\(

Check the workflow and test it with different data samples that you have and let us know if it does not match for any of the data.