Regex to extract character in pdf table

Hi All,
I need to extract the items in “Deskripsi” and “Tipe/Kode” as below
image

I used System.Text.RegularExpressions.Regex.Match(match.Value, “(?<=^\d+\s)[A-Za-z\s.,()]+(?=\s[A-Z]+\d+)”).Value for the “Deskripsi” and I only get as the result of number 8,10,11,12.
8 mm tube
10 in))
11 in))
12 in))

Also, any idea for the regex of various “Tipe/Kode” codes?

Thanks a lot :slight_smile:

1 Like

HI,

Do you use ReadPdfText activity? If so, can you share the text as a text file?

Regards,

Hi @Yoichi ,

Yes, I used readpdftext activity. Here’s the txt file
test.txt (636 Bytes)

Hi,

In this case, I recommend to use regex.replace and GenerateDataTable as the following, because we can easily handle each data from datatable.

strData = System.Text.RegularExpressions.Regex.Replace(strData,"(?<=(^|\r?\n)\d+)\s+|\s+(?=\w+(\r?\n|$))",chr(9))
strData = System.Text.RegularExpressions.Regex.Replace(strData,"^.*\r?\n","No."+chr(9)+"Deskripsi"+chr(9)+"Tipe / Kode"+vbCrLf)

Then get datatable using GenerateDataTable

Sample20230313-2L (2).zip (8.8 KB)

Regards,

Hi @Yoichi, thanks a lot!
It seems work but I got more question, how to just extract the table? since there are some sentences before the table. Maybe you can check on this txt file.
test1.txt (7.5 KB)

Hi,

In this case, it’s better to extract necessary lines in advance. Can you check the following sample?

mc = System.Text.RegularExpressions.Regex.Matches(strData,"(?<=No. Deskripsi Tipe / Kode\r?\n)(\d+\s.*\n)+")

then

strData =String.Join("",mc.Cast(Of System.Text.RegularExpressions.Match).Select(Function(m) m.Value))

Sample20230313-2L (3).zip (6.4 KB)

Regards,

@Yoichi looks great! :slight_smile: thanks a lot
But there still some unnecessary line between and at the last. Any idea?

HI,

In my environment, it seems no problem as the following image.

image

Did you use same input data in the above?

Regards,

Hi @Yoichi , yes it’s the same data

Hi,

It seems strange…
I just modified it as the following. Can you try this?

System.Text.RegularExpressions.Regex.Matches(strData,"(?<=No. Deskripsi Tipe / Kode\r?\n)(\d+\s.*?\n)+(?=\D|$)")

Sample20230313-2L (4).zip (6.4 KB)

Regards,

2 Likes

It’s working! thanks a lot :slight_smile:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.