Regex email from pdf issues

Hi Guys,

Trying to extract the email address info@amazon.com from a pdf file using regex.
I have a few different regex codes for pulling email addresses which work perfectly but only on other pdfs, not this one.
This code returns nothing, \b[\w.-]+@[\w.-]+.\w{2,4}\b

This code includes the characters on the following page,
[a-z0-9!#$%&'+/=?^_{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_{|}~-]+)@(?:a-z0-9?.)+a-z0-9?

As you can see from the screenshot its pulling in extra characters from the following page.
Ive attached the pdf, the email is on the bottom of a particular page but the regex is pulling the first word on the next page also.
I can easily trim it off but wondering why its happening and can it be solved purely by regex.

Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.
info@amazon.com

Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s,
Lorem Ipsum is simply dummy text of the printing and typesetting industry.

162.pdf (95.8 KB)

Screenshot 2023-01-18 172310

Thanks guys :slightly_smiling_face:

@MikeC You can use this Regular Expression

     Assign str_emailvariable = System.Text.RegularExpressions.Regex.Match(ur_variable,"(\w+([-+.]\w+)*@\w+([-.]\w+)*\.\w+([-.]\w+)*)").ToString

Thanks but its still pulling in the word from the next page, I think its the pdf rather than the code.

@MikeC Can you share the regex code

Their in the first post :slightly_smiling_face:

Use this regex pattern

No go, still pulls in Lorem from next page…

Its definately the pdf, this is the read pdf text, its joing the email with the word on the next page…

Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.

AMAZON

Lorem Ipsum has been the industry’s standard dummy text ever since the 1500s,
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
when an unknown printer took a galley of type
and scrambled it to make a type specimen book.
It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing
Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.

info@amazon.comLorem Ipsum has been the industry’s standard dummy text ever since the 1500s,
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
when an unknown printer took a galley of type
and scrambled it to make a type specimen book.
It has survived not only five centuries,
but also the leap into electronic typesetting, remaining essentially unchanged.
It was popularised in the 1960s with the release of Letraset sheets containing
Lorem Ipsum passages, and more recently with desktop publishing software
like Aldus PageMaker including versions of Lorem Ipsum.

@MikeC Try this Regex

2 Likes

Perfecto, thanks buddy :ok_hand:

1 Like

You can also try [a-zA-Z0-9_-.]+[@][a-z]+[.][a-z]{2,3} to get the email

You can also make a use of Matches activity.

2 Likes

This also works but had to remove the minus symbol after the 0-9_

[a-zA-Z0-9_.]+[@][a-z]+[.][a-z]{2,3}

Thank you :ok_hand:

If you see the code, it has backslash \ . If you use that code, it will work.
I don’t know why but backslash got removed when I posted that reply.

Haaaa just noticed the missing slash, it actually works with or without the slash.
Thanks again :ok_hand:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.