Problem with extracting specific text in a pdf

Hello,

I have been trying to solve this problem for several hours but still have not come up with a correct choice in my process. Thought you might help me here on the forum.

What I want to do is - Open a pdf document with a lot of unstructured text. In the text I want to extract a specific code consisting of 2 letters and 6 digits. Then I’ll copy the code and paste it into a web browser. I have tried to use out of matches / Regex but not got it and work.

Grateful for all the answers

1 Like

Hi
Welcome to uipath community
That specific text might have a solid term around like Invoice = INV124 so here Invoice is the term next to the text we need in specific
Do we have any such
Or kindly share a sample of your text from where we need to extract the term

Cheers @FekkeMalin

1 Like

Hi,

Actually there are 3 pdfs that I would lo extract a code from. Each of them have different styles of unsorted text. What I would like to do is just using screen scraping on the specific code on each document so I can post it to web browser. However I am not able to do that.

Yah we can do that either
—use Start process and pass the file path pdf as input
—Try with Screen Scrapping method
—once done we wouldn’t get the text as output and from that we can get term we want with regular expression or split method

Cheers @FekkeMalin

Thank you for the answer but I really don’t get it.

Can you please guideline me on this.

First of all I’m going to open a file called pdf1 on my computer.
On this pdf file I would like to extract a specific text out of all text in the file. Text that I would like to copy is QR322343

After that I would like to copy that specific text and copy it to a search bar on a homepage.

How do I do that?

Best Regards,
Fekke

Will this QR remain the same in all text or will it differ like AB12445 or OT12454

@FekkeMalin

The QR will differ like AB124455 or OT124545. How ever the rule is always 2 letter and 6 numbers.

The QR will differ like AB124455 or OT124545. How ever the rule is always 2 letter and 6 numbers.

The QR will differ like AB124455 or OT124545. How ever the rule is always 2 letter and 6 numbers.

1 Like

Fantastic
so once after getting the text with a variable of type string named str_pdf from pdf use this expression in a assign activity to get the value

list_output = System.Text.RegularExpressions.Regex.Matches(str_pdf,“[1].[0-9]{6}”).ToString
this expression would give you any string with two character and six numbers in the pdf

whre list_output is a variable of type System.Collections.Generic.IEnumerable(System.Text.RegularExpressions.Regex.Match)

–then use a for each loop and pass the above variable as input and let the type argument be object itself in the property panel of for each loop
–inside the loop use a writeline activity like this
item.ToString

or if we feel like there would be only one text like that then simply one expression in writeline
System.Text.RegularExpressions.Regex.Match(str_pdf,“[2].[0-9]{6}”).ToString

and it worked as well
image

kindly try this and let know for any queries or clarification
Cheers @FekkeMalin


  1. A-Z ↩︎

  2. A-Z ↩︎

Hi Palaniyappan,

After your solution I wanted to check if it’s working so I putted a message box after assign activity.

It says System.Text.RegularExpressions.MatchCollection and not the code? What have I done wrong? I did as you told me. For some reason I don’t think that the excel file reads clearly.

Best Regards,
Fekke

Hi Palaniyappan,

After your solution I wanted to check if it’s working so I putted a message box after assign activity.

It says System.Text.RegularExpressions.MatchCollection and not the code? What have I done wrong? I did as you told me. For some reason I don’t think that the excel file reads clearly.

Best Regards,
Fekke