Need a Stable regex to get email from unorganized pdf

I have a pdf from which I need to extract a specific email address. The pdf contains data in English and Arabic language hence after read pdf activity the text that I receive is something like this:

case 1:
Name مصلح مسعد عواد الجهني الاسم:
Nationality المملكة العربية َّسية: الجن
ID No. 1346795643 َّوية: رقم اله ID Type هوية وطنية َّوية: نوع اله
Email john2010@gmail. البريد الإلكتروني: Mobile No. +966123564785 َّرقم الجوال:
National Address "

case 2:

ID No. 1230645156
National Address
فهد عبدالعزيز فهد الحكير
َّوية: رقم اله ID Type
البريد الإلكتروني: Mobile No.
الرياض, الرياض
تاريخ الانتهاء تاريخ الاصدار "

I have mentioned the email id in Bold : case1 email should be -, case 2 email should be -

The reason for such structure is, the pdf gets disorganized after being converted to string.

Appreciate any help using string manipulation or regex.


For now, can you try the following? This can extract email address for the above 2 sample.

yourString = System.Text.RegularExpressions.Regex.Match(yourString,"(?<=ID No\.\s+\d+)\D[\s\S]*?(?=National Address)").Value

yourString = System.Text.RegularExpressions.Regex.Replace(yourString,"Mobile\s*No\.\s*[-+\d]*","")

yourString = yourString.Replace("ID Type","").Replace("Email","")

yourString = System.Text.RegularExpressions.Regex.Replace(yourString,"[\p{IsArabic}\s:]","")

Sequence.xaml (8.4 KB)

However, it might be get incorrect address if input string structure is different from the sample.
It may be good to consider to use DocumentUnderstanding framework.


Thanks for the solution @Yoichi . The regex and string manipulation worked fine for almost 35 samples I tested.

You’re correct that we must use DU, however, it will incur extra cost for client and we’re trying to preserve that.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.