Trouble in Extracting the Data with Regex

Hi Team,

Here i am trying to Extract the Data using the Regex but not giving the Proper output, details below

Details:

Here Every Line Item Starts with “MADPL SPARE”(Dynamic) and Ends with “YOUR INV.NO”(Static)

In this Line item some of the fields need to Extrac like “Amount Paid” and “TDS” amount and “Your Invoice Number”

So the Expected out put looks like

EXPECTED Output:

Expected_Output

Here i am attaching the Input Text file and Word doc(For your Reference) and output Excel

Input Text File:
MAHI.txt (5.1 KB)

word doc (For Reference):
Heighligt.docx (18.6 KB)

Excel output:
Output.xlsx (8.2 KB)

Please guide me to resolve the issue.

Thanks and regards,
Sai adarsh.

Hi @adarsh_kotagiri ,

After an initial check with the data file, we observed that not everywhere the MADPL SPARE word is present.
image

In the above image we see that the word MADPL SPARE starts from 4th line, while there are some data previous to it.

What should be done for such case ? Do we need to only retrieve the data starting from that word ?

Also, Let us know if the source file is a PDF or is it the Word document itself.

This is possible. One question is, do the invoices always have the same structure?

For example does the TDS row always have x xx x xx x xx

Hi @Ninjabullen
Yes please take the text form the text file, not from the Doc file.

Regards,
Sai adarsh.

Hi @supermanPunch

The input is pdf, so please refer the text file, that is extracted from pdf, and the "MPDCL SPARE " is Dynamic.

Regards
Adarsh.

@adarsh_kotagiri ,

In that case, could you confirm when extracting the data from PDF, was the PreserveFormat property set to True in Read PDF Text activity, assuming you have used the Read PDF Text activity.

Hi @supermanPunch

Thanks for that i did not enable the Preserve format True,Please refer the below text file which is generated with Preserve format True.
MAHI.txt (6.4 KB)

Thanks and regards,
sai adarsh

Hi @Ninjabullen

Yes the structure will be same.

Thanks and regards.
Adarsh

Hi @adarsh_kotagiri ,

Could you Check the below workflow :
PDF_Extract_Tables.zip (4.7 KB)

  1. As the data contains other information along with the required information, first we can perform a removal of unwanted information from the data to process, leaving us with only the table data i.e., the data between the keywords Retention Amount and Payment document.
    Therefore using the below regex to get only data sections required :
(?<=Retention Amount)[\S\s]+?(?=Payment document)

The above image shows us that for the Data provided there are 3 matches, meaning 3 such sections where the data lies between Retention Amount and Payment document.

  1. Next, we further process these matches sections and further remove unwanted empty lines and dotted lines and combine the sections with a New Line entry. This is done using the below Expression :
String.Join(Environment.NewLine,mc1.Cast(Of Match).Select(Function(x)x.Value.Trim.Trim("_".ToCharArray).Trim))

Here, mc1 is the Output of Matches activity where the above regex is applied.

So the Output would be the below after performing the above step :

  1. Now to process or extract each row as required, we use the following pattern :
^(?!YOUR INV.NO.)(?<Loc>.+?)\s{2,}(.*?)\s{2,}(?<DocNo>\d+)\s{2,}(?<InvoiceDate>\d+\/\d+\/\d+)\s*(?<Amt>\d+\s\d+)\s*YOUR INV.NO.\s(?<InvNo>\d+)\s*(.*?)\s{2,}(\d+)\s{2,}(\d+\/\d+\/\d+)\s{2,}(?<TDS>\d+\s\d+)\s{2,}(\d+\s\d+)\s{2,}(\d+\s\d+)

  1. Using the Matches activity with the above regex, would get us each row as needed and the required information (named groups).

Check the workflow and test it with other data and let us know if it is not able to give the proper output for all.

1 Like

Hi @supermanPunch

Thank you so much for taking time and building the work flow, its working amazing for all the pdf’s and extracting the perfect data without error.

learned a lot with this workflow.

Thanks and Regards,
Sai adarsh.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.