PDF REGEX

Hi Guys,

I have the following question. I would like to extract everything that is between two identical key words on the same line. The input from which the data is extracted is however multiline. I

For example:

Dear customer,

Thank you for choosing xxx. We are happy to present to you this fillable form. xxxxxxxxxxxxxxxxxx

Name: john doe Name: Elisabeth Taylor

In this case i would like to extract the names John Doe and Elizabeth Taylor. Can someone please help me with the Regex for this.

Regards,

Try this (multiline option must be on if there is more text after the second name):

(?<=Name: ).+?(?=( Name: )|$)

1 Like

Thank you! and how would the regex go if i need both names in two seperate variables?

Just use the regular expression with Matches. You will get a collection of all the matches.

name1 = matchesResult(0).ToString
name2 = matchesResult(1).ToString

Hi @Matthewvz

I think to get the Element from matches varible

U should try

name_1= matches.ElementAt(0).ToString

name_2= matches.ElementAt(1).ToString

And what if for example i would like to extract only John Doe (and the name of the other person isnt filled in) and what if i only want to extract elizabeth taylor (when the name John doe isnt filled in). How would the Regex look like in those cases?

@Matthewvz - you mean , like stated below…

In that case, you can check for String.IsNullOrEmpty(YourRegexvariable(0).tostring) - If is true…skip writing…

1 Like

@Matthewvz - Please give it a try , as shown in the below screenshot…

Thanks for the swift reply again. What i mean is. I also have other stuff to extract like (below the names) are:

Tax withheld: € 15000
wage tax: € 1000

How do i extract those numbers? and what if it is illustrated like this

tax withheld: € 1500 do you agree (yes/no)

Thank you guys so much!

@Matthewvz - Please share the exact text , i.e add the tax withheld to the original text and share one final text with us…

And provide the clear requirement which would help us in providing a good solution without going backndforth, because you didn’t mention anywhere in your original query about these…

The general pattern if you want something between xx and yy is:

(?<=xx).+?(?=yy)

Just replace xx with what’s in front what you want to capture and yy with what’s behind.

E.g. applying the pattern

(?<=John ).+?(?= Doe)

on the text “John Hamilton Doe” will give you “Hamilton”.

See this post to learn more about regex:

3 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.