How to find a particular text in PDF, then extract a chunk of information?

Hey Guys!

I want to search a PDF document for a particular prefix or key words, then extract a chunk of information that’s related to it. I’ve attached a sample PDF Sample.pdf (41.2 KB) document that i need to work on.

It’s a kind of statement that contains several kinds of information, but i’m only interested in extracting
information that contains the prefix “LBC” and the information related to it, so there could be more than 1 instance of this information appearing.

Following image is an example of the information that i want to retrieve: Imgur: The magic of the Internet

Thereafter, i’d like to extract these information into a spreadsheet in order for me to work on the information extracted, and also generate a report.

Is this possible? I’ve tried searching the forums for a possible solution but to no avail. Therefore, any help rendered will be greatly appreciated and many many thanks in advance!

Cheers!
Jeremy

Hi,

How about the following?

Main.xaml (7.0 KB)

Regards,

1 Like

hey there!

thanks for the quick reply! it seems to do the trick when i tried using the Sample pdf file that i uploaded. But when i tried it on another pdf (the actual one) it extracts a lot more information that i need.

I’ve attached an edited picture of the results i encountered on the actual pdf (erased some sensitive data on it), would you be able to tell why did it extract information that it shouldn’t?Results|221x500

Sorry for not understanding the regex statements if the answers to my questions are within them. i’m unfamiliar with regex statements and my past processes do not use them at all :frowning: is there somewhere i can understand more about the syntax used in a regex statement?

Hi,

The workflow assumes each block is separated by 2 line-breaks.
We need to know other rule to separate to each block.(For example If SGD is always the first word in block, we can separate using it)

It’s difficult for me to introduce good site for Regex, because my native language is not English.
There are many regex site in the internet. Please search it by google etc.

Regards,

I think i roughly get your explanation. Each block does not start with SGD however, it ends with SGD + an amount.

Each block starts with the word “Inward” though. Would you be able show an example using the script you’ve attached above? Many many thanks!

Side Note: i found a site containing a number of regex syntax @ Regex Cheat Sheet

Hi,

it ends with SGD + an amount.

All right.
Can you try the following?

matches = System.Text.RegularExpressions.Regex.Matches(s,"[\s\S]+?SGD\s*[\d.]+\s*")
For each item in marches (Now item is Match (System.Text.Regulaexpressions.Regex.Match) type)
You can get content using item.Value instad of item in previous workflow.

Regards,

i’ll give it a shot and see if it works for me! Thank you so muchh!