Trouble in Extracting the Data with Regex

Hi @adarsh_kotagiri ,

Could you Check the below workflow :
PDF_Extract_Tables.zip (4.7 KB)

  1. As the data contains other information along with the required information, first we can perform a removal of unwanted information from the data to process, leaving us with only the table data i.e., the data between the keywords Retention Amount and Payment document.
    Therefore using the below regex to get only data sections required :
(?<=Retention Amount)[\S\s]+?(?=Payment document)

The above image shows us that for the Data provided there are 3 matches, meaning 3 such sections where the data lies between Retention Amount and Payment document.

  1. Next, we further process these matches sections and further remove unwanted empty lines and dotted lines and combine the sections with a New Line entry. This is done using the below Expression :
String.Join(Environment.NewLine,mc1.Cast(Of Match).Select(Function(x)x.Value.Trim.Trim("_".ToCharArray).Trim))

Here, mc1 is the Output of Matches activity where the above regex is applied.

So the Output would be the below after performing the above step :

  1. Now to process or extract each row as required, we use the following pattern :
^(?!YOUR INV.NO.)(?<Loc>.+?)\s{2,}(.*?)\s{2,}(?<DocNo>\d+)\s{2,}(?<InvoiceDate>\d+\/\d+\/\d+)\s*(?<Amt>\d+\s\d+)\s*YOUR INV.NO.\s(?<InvNo>\d+)\s*(.*?)\s{2,}(\d+)\s{2,}(\d+\/\d+\/\d+)\s{2,}(?<TDS>\d+\s\d+)\s{2,}(\d+\s\d+)\s{2,}(\d+\s\d+)

  1. Using the Matches activity with the above regex, would get us each row as needed and the required information (named groups).

Check the workflow and test it with other data and let us know if it is not able to give the proper output for all.

1 Like