Could you check the below workflow :
PDF_Extract_TableData.zip (52.8 KB)
The methods and the explanation to it are similar to another post where a similar extraction requirement was needed. You can find the post below :
-
Firstly, the Output Datatable is created and prepared by using
Build Datatableand Adding required Column Names using aFor EachandAdd Data Columnactivity. -
Next, Each of the values relative to the whole document is extracted (Invoice Number, Invoice Date,…)
-
Next, We filter and keep only the Table data for extracting each row of the table. We use the below regex for Table part Extraction :
(?<=Incentive)[\S\s]+(?<=Total Fees)
The Data within Incentive and Total Fees keywords are extracted.
- We then use the Extracted Table Part for further extraction of each row data using the below regex :
(?=^\s+(\d+))[\S\s]+?(?=^\s+\d+|Total Fees)
- Next, Within Each row match, we identify the required values using the below Regex which has named groups for easy notice and retrieval
\s+(?<Item>\d+)\s{2,}(?<Name>.*?)\s{2,}(\d+)\s{2,}(?<FY>\d+\s\/\s\d+)\s{2,}(?<Rate>.*?)\s{2,}(?<NOU>.*?)\s{2,}(?<ATR>.*?)\s{2,}(?<ATU>.*?)\s{2,}(?<RI>.*?)\s{2,}(?<EXP>.*?)\s{2,}(?<FEES>.*)
- Other assumption is that the words
Deloitte,Hourly,Accountingis not needed after the First Line data is captured, hence we use the below regex to replace it with empty values :
\(Deloitte|Accounting|Period|\)|\(Hourly\)|\(
Check the workflow and test it with different data samples that you have and let us know if it does not match for any of the data.