How to Extract the fields from pdf using Regex

Hi Team,

Here i am trying to Extract the Fields by using the Regular expression, can any one help me out how to do this for the given sample input as shown below.

Here input i am using is Native pdf, as shown below.

After Convrting in to Text i am getting this text file, as shown below.
Output_Text.txt (486 Bytes)

So the Required output in excel, as shown below.
Output Excel.xlsx (9.1 KB)

Here i am not considering the Description Column
Thanks in advance.

HI @adarsh_kotagiri

Checkout this threads

Regards
Sudharsan

1 Like

HI,

Can you try the following sample?

mc = System.Text.RegularExpressions.Regex.Matches(strData,"^(?<INVREF>\S+)\s+(?<POSTDATE>\d{2}/\d{2}/\d{4}).*?(?<INVDATE>\d{2}/\d{2}/\d{4})\s+(?<GROSSAMOUNT>[-\d.,]+)\s+(?<TDS>[-\d.,]+)\s+(?<AMOUNT>[-\d.,]+)",System.Text.RegularExpressions.RegexOptions.Multiline)

Then, set the following at ArrayRow

{m.Groups("INVREF").Value,DateTime.ParseExact(m.Groups("POSTDATE").Value,"dd/MM/yyyy",System.Globalization.CultureInfo.InvariantCulture),DateTime.ParseExact(m.Groups("INVDATE").Value,"dd/MM/yyyy",System.Globalization.CultureInfo.InvariantCulture),m.Groups("GROSSAMOUNT").Value,m.Groups("TDS").Value,m.Groups("AMOUNT").Value}

Sample20230120-3L.zip (4.1 KB)

Regards,

2 Likes

Go through this video.

RegEx

1 Like

Hi @Yoichi

Thank you so much for the work flow, it is working fine, the only challange i am facing is i am unable get the multi line data in single row as shown below.

The output i am getting in this format.

Output

But my Excepted Out put is
Expected Output

Please help me in this.

Thanks in advance.

HI,

It seems difficult to identify which string should be added to INV.REF string in the previous line.

If string to be added always consists of numeric characters at the beginning of the line, the following will work.
(However, i suppose it may not work as there are various cases…)

Sample20230120-3Lv2.zip (4.6 KB)

Regards,

1 Like

Hi @adarsh_kotagiri ,

Could you maybe try checking the below post on PDF Table Extraction. Here the input is the native PDF text file itself.

Let us know if this does not help for your case.

1 Like

Thank you @Yoichi its working perfectly and i learned something new.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.