PDF Text Separation to Excel Row

Hi All,
I want to extract the text from pdf and write it to excel table as this picture below. The codes in “Tipe” are seperated by comma but I want to separate it in rows in excel file and have the same “Name Produk” for each row “Tipe” (as pict below).
One more thing, sometimes the “Tipe” codes do not start with G but it could be any characters such as number or other alphabet.

Please let me know the coding/regex. Appreciate your kind help.

Hi,

Can you share input PDF file?

Regards,

test.pdf (12.2 KB)
Hi @Yoichi , please check this attachment

Hi,

The data exists as not text but image in the pdf file.
What OCR are you planning to use?
If possible, can you also share text data using the OCR?
Or your original pdf has text?

Regards,

Hi @Yoichi , the pdf actually has text however I’m not able to share the full pdf since it contains sensitive information.
I’ve tried to read pdf text and put the result here in the txt file. Hope this is usefull.
text data.txt (478 Bytes)

Hi,

Can you try the following sample?

mc = System. Text.RegularExpressions.Regex.Matches(strPdf,"(\s+[A-Za-z0-9]+,)")

dt = mc.Cast(Of System.Text.RegularExpressions.Match).Select(Function(m) dt.LoadDataRow({strNo,strName,m.Value.Trim().Trim(","c)},False)).CopyToDataTable()

Sample20230616-7L.zip (14.4 KB)

Regards,

Hi @Yoichi , it works! I’m just wondering how the regex will be if the the “Tipe” is only “-” instead of the set of alphanumeric?

Hi,

How about the following?
I also fixed the previous pattern doesn’t get the last Tipe.

mc = System. Text.RegularExpressions.Regex.Matches(strPdf,"(\s+[A-Za-z0-9]+,)|(\s+[A-Za-z0-9]+\r?\n)|((?<=-|\s)-(?=\s|$))")

Sample20230616-7Lv2.zip (14.4 KB)

Regards,

@Yochi it worked to get the last Tipe. Just found out the Tipe word is written twice. Any idea?
image

Hi,

Can you try the following? I reviewed logic and the pattern.

Sample20230616-7Lv3.zip (14.4 KB)

Regards,

1 Like

Works perfectly! Thanks @Yoichi :slight_smile:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.