Need a Stable regex to get email from unorganized pdf

lucky_B · August 4, 2022, 7:47am

I have a pdf from which I need to extract a specific email address. The pdf contains data in English and Arabic language hence after read pdf activity the text that I receive is something like this:

case 1:
"
Name مصلح مسعد عواد الجهني الاسم:
Nationality المملكة العربية َّسية: الجن
السعودية
ID No. 1346795643 َّوية: رقم اله ID Type هوية وطنية َّوية: نوع اله
Email john2010@gmail. البريد الإلكتروني: Mobile No. +966123564785 َّرقم الجوال:
com
National Address "

case 2:

"
Name
ID No. 1230645156
john@ya
Email hoo.com.u
k
National Address
فهد عبدالعزيز فهد الحكير
Nationality
َّوية: رقم اله ID Type
البريد الإلكتروني: Mobile No.
الرياض, الرياض
تاريخ الانتهاء تاريخ الاصدار "

I have mentioned the email id in Bold : case1 email should be - john2010@gmail.com, case 2 email should be - john@yahoo.com.uk

The reason for such structure is, the pdf gets disorganized after being converted to string.

Appreciate any help using string manipulation or regex.

Yoichi · August 4, 2022, 8:27am

Hi,

For now, can you try the following? This can extract email address for the above 2 sample.

yourString = System.Text.RegularExpressions.Regex.Match(yourString,"(?<=ID No\.\s+\d+)\D[\s\S]*?(?=National Address)").Value

yourString = System.Text.RegularExpressions.Regex.Replace(yourString,"Mobile\s*No\.\s*[-+\d]*","")

yourString = yourString.Replace("ID Type","").Replace("Email","")

yourString = System.Text.RegularExpressions.Regex.Replace(yourString,"[\p{IsArabic}\s:]","")

Sequence.xaml (8.4 KB)

However, it might be get incorrect address if input string structure is different from the sample.
It may be good to consider to use DocumentUnderstanding framework.

https://docs.uipath.com/document-understanding

Regards,

lucky_B · August 4, 2022, 9:20am

Thanks for the solution @Yoichi . The regex and string manipulation worked fine for almost 35 samples I tested.

You’re correct that we must use DU, however, it will incur extra cost for client and we’re trying to preserve that.

system · August 7, 2022, 9:21am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
How to Use RegEx in Pdf? Help mail , activities , studio	3	1260	July 22, 2019
Regex email from pdf issues Studio	14	996	January 21, 2023
Regular Expression for email Activities activities	3	1652	July 29, 2021
Extract adress from scanned pdf Help pdf , activities , question	10	1384	January 25, 2020
Extract email data from pdf file content in Ui path Help	8	1955	January 29, 2020

Need a Stable regex to get email from unorganized pdf

Related topics