Need to remove unnecessary letter or replace with correct value while extracting data from scanned pdf using form extractor

Hi,
I am having a problem while I extract the data from the scanned document. I have two issues.
1.
I need to extract postcode. But there is coma (,) next to postcode. When I select the postcode area, it is also taking the (,). But I need to remove it. I tried to remove it by using Present validation station. It didn’t work. How can I resolve this issue?

image

2nd issue is regarding email address extraction. In stead of @, it is taking “P”. But i need to replace it with @. How can I do that?

Regards,
Ekram

You can use the String method - Replace - to make the changes. The first argument is the text you want to remove and the second argument is what you want to replace it with.

  1. postcode.Replace(“,”,“”)
  2. email.Replace(“P”,“@”)

Hi @aman.sharma1 ,
Thank you. I have shared some screenshots so that you can understand how I am extracting the data. Where can I use remove or replace function when the data is in dataset?

image

image

image

Regards,
Ekram

One way is – after the Export Extraction Results activity, use the For Each Row activity on dataset.Tables(0). Inside the For each row, you can place this Replace logic, ensuring that you work with the correct column data.

So, for example, if the zip code is in the 3rd column of dataset.Tables(0), then inside the For each row, you will have row(2).Replace(“,”,“”).

But it’s possible I am not understanding the structure of your dataset variable. In which case you will need to display the contents of dataset to figure out how it’s storing the zip code and email.

@emshihab Try the below expression with assign activity after Export Extraction Results activity

To replace comma with space for PostCode

Input.Tables(0).Rows(0).Item("PostCode")=Input.Tables(0).Rows(0).Item("PostCode").ToString.Replace(",","").Trim
  • Input is your dataset name. This expressions updates the value to the same data table in dataset

To replace P with @ for email

Input.Tables(0).Rows(0).Item("Email")=Input.Tables(0).Rows(0).Item("Email").ToString.Replace("P","@").Trim

Hi @emshihab ,

We would require to know if the email value will always be lower case, and the P in place of @ after extraction is always Capital. If this is the case always, then we can perform String Replace methods.

Else, we would want to know some more details about your extraction, what was the OCR used? what was the Extractor used?

1 Like

hi everyone,
@ushu the solution you have provided, it worked.
@supermanPunch, I have used omnipage OCR engine and extractor is form extractor.

I would like to mention one point regarding email address. Email address is dynamic. It could be changed. So if email address has more “P”, then it would also be replaced by “@” which is wrong.
So my concept is that to replace only 8th no character(“p”) from the last into “@” . (martinPdef.com). Because “.com” is common for all email and “def” is company name. That means @ is located before the company name. Is it possible to replace based on the letter position?

@emshihab ,

Maybe you could try using UiPath OCR and Check if the email extraction is appearing properly.

If this is going to be the pattern always, then we could perform String Replace method by Index in the below way :

"martinPdef.com".Remove("martinPdef.com".Length-8,1).Insert("martinPdef.com".Length-8,"@")
1 Like

@emshihab If you want to replace P which comes before companyname.com then try with below exp. If there is P then it replace with @ else it won’t do anything

Input.Tables(0).Rows(0).Item("Email")=System.Text.RegularExpressions.regex.Replace(Input.Tables(0).Rows(0).Item("Email").ToString,"P(?=\D+\.com)","@").Trim

@ushu thanks. All of your solution work perfectly.
@supermanPunch, I will try with other ocr engine.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.