Table data extraction from PDF documents

supermanPunch · March 30, 2023, 7:01pm

Could you check the below workflow :
PDF_Extract_TableData.zip (52.8 KB)

The methods and the explanation to it are similar to another post where a similar extraction requirement was needed. You can find the post below :

Firstly, the Output Datatable is created and prepared by using Build Datatable and Adding required Column Names using a For Each and Add Data Column activity.
Next, Each of the values relative to the whole document is extracted (Invoice Number, Invoice Date,…)
Next, We filter and keep only the Table data for extracting each row of the table. We use the below regex for Table part Extraction :

(?<=Incentive)[\S\s]+(?<=Total Fees)

The Data within Incentive and Total Fees keywords are extracted.

We then use the Extracted Table Part for further extraction of each row data using the below regex :

(?=^\s+(\d+))[\S\s]+?(?=^\s+\d+|Total Fees)

Next, Within Each row match, we identify the required values using the below Regex which has named groups for easy notice and retrieval

\s+(?<Item>\d+)\s{2,}(?<Name>.*?)\s{2,}(\d+)\s{2,}(?<FY>\d+\s\/\s\d+)\s{2,}(?<Rate>.*?)\s{2,}(?<NOU>.*?)\s{2,}(?<ATR>.*?)\s{2,}(?<ATU>.*?)\s{2,}(?<RI>.*?)\s{2,}(?<EXP>.*?)\s{2,}(?<FEES>.*)

Other assumption is that the words Deloitte, Hourly, Accounting is not needed after the First Line data is captured, hence we use the below regex to replace it with empty values :

\(Deloitte|Accounting|Period|\)|\(Hourly\)|\(

Check the workflow and test it with different data samples that you have and let us know if it does not match for any of the data.

Topic		Replies	Views
Extract table from PDF using Regex Studio	3	2322	February 24, 2021
Extract specific table within PDF Form with RegEx Studio studio , question , activities_panel	12	1869	March 8, 2023
Extract specific value and table from pdf using document understanding Studio studio , question , activities_panel	10	985	October 26, 2021
Extract values in PDF Studio	8	1357	June 16, 2023
Table from PDF text Studio studio , question , activities_panel	6	64	March 21, 2025

Table data extraction from PDF documents

Related topics