How to extract text beside a keyword from PDF


I have this data and I would like to extract the email and mobile number etc. beside their respective keywords. (e.g. input: email: xxx, output: xxx)

Hey,
You can use Read PDF text Activity or if this pdf is kind of image then you can use Read pdf with OCR after that you can do string Manipulation.

Thanks,
Rounak

Hi,
You can follow this workflow as i attached
image.pdf (17.1 KB)
Sample2.xaml (7.4 KB)
Thanks,
Rounak

@audrxyx

  • Convert this pdf to text using Read PDF Text Activity
  • Now, you can apply regular expressions to extract the data
  • To get Email Address, take assign activity
Variable of type string Email = System.Text.RegularExpressions.Regex.Match(PDFText, "(?<=Email:).*.com").ToString
  • To get the Mobile no
Variable of type string Mobile = System.Text.RegularExpressions.Regex.Match(PDFText, "(?<=Mobile:)\d+").ToString
  • PDFText is the output variable of Read PDF Text Activity

If this didn’t work please share a sample pdf

Hi!

The email extraction worked but the mobile number did not.

Can you also assist me on how to extract the rest of the variables. Thank you!

Raheem Mohamed Resume.pdf (64.3 KB)

Hey @audrxyx

Try this:
Variable of type string Mobile = System.Text.RegularExpressions.Regex.Match(PDFText, “(?<=Mobile:\s*)\d+”).ToString

You can learn Regex by checking out my Regex MegaPost

Cheers

Steve

@audrxyx Try the below one for mobile number

Variable of type string Mobile = System.Text.RegularExpressions.Regex.Match(PDFText, "(?<=Mobile:\s+).*").ToString

Hey! @ushu

This won’t work…

If the string is in one line this will select the entire text instead of numbers…

Hey @audrxyx

Try in this way:

System.Text.RegularExpressions.RegEx.Match(InputStringVariable,"(?<=Mobile:| )\d+")).ToString

Reference:

Regards,
NaNi