I want to search a PDF document for a particular prefix or key words, then extract a chunk of information that’s related to it. I’ve attached a sample PDF Sample.pdf (41.2 KB) document that i need to work on.
It’s a kind of statement that contains several kinds of information, but i’m only interested in extracting
information that contains the prefix “LBC” and the information related to it, so there could be more than 1 instance of this information appearing.
Thereafter, i’d like to extract these information into a spreadsheet in order for me to work on the information extracted, and also generate a report.
Is this possible? I’ve tried searching the forums for a possible solution but to no avail. Therefore, any help rendered will be greatly appreciated and many many thanks in advance!
thanks for the quick reply! it seems to do the trick when i tried using the Sample pdf file that i uploaded. But when i tried it on another pdf (the actual one) it extracts a lot more information that i need.
I’ve attached an edited picture of the results i encountered on the actual pdf (erased some sensitive data on it), would you be able to tell why did it extract information that it shouldn’t?Results|221x500
Sorry for not understanding the regex statements if the answers to my questions are within them. i’m unfamiliar with regex statements and my past processes do not use them at all is there somewhere i can understand more about the syntax used in a regex statement?
The workflow assumes each block is separated by 2 line-breaks.
We need to know other rule to separate to each block.(For example If SGD is always the first word in block, we can separate using it)
It’s difficult for me to introduce good site for Regex, because my native language is not English.
There are many regex site in the internet. Please search it by google etc.
matches = System.Text.RegularExpressions.Regex.Matches(s,"[\s\S]+?SGD\s*[\d.]+\s*") For each item in marches (Now item is Match (System.Text.Regulaexpressions.Regex.Match) type)
You can get content using item.Value instad of item in previous workflow.