Find a specific text in a PDF and which it is not always located in the same position

Hi everyone,

I really would like to get your approaches about the following use.

The main thing is a PDF file in which providers stick a label which always have same format. Of course data on that label is different based on provider, analyzed product, inspection batch, etc…but the label has always same layout and format.

One of the informations on that label is a barcode located always at the bottom of the label containing the key data I need to get. The problem here is that every provider sticks the label in a diferent area on the PDF file so, that makes it quite difficult to find the barcode.

How would you approach this case? Please let me know if you need further details for your better understanding. And, no, I cannot share a sample file as there’s confindential information on it.

Thanks a lot for your help!

Hi @jferre !
If the pdf is not an image, then I would have converted it into txt, and compared several txt to see if there is a common element before and after the label. Then I would have used Substring method or Regex to fetch the data.
Otherwise, I think that machine learning would suite your case (like making a neural network learn what is a label, giving it as much samples as possible and fetching the wanted data) but i don’t know how does it work with uipath, and the neural network cannot be created quickly


@jferre - As @Hiba_B said…

  1. First, use Read PDF Text activity using “Preserve Format” set to True and save the output file as StrText

  2. Use StrText in "Matches" activity if you are using Regex or use any other string manipulation method…

If could share the sample text file(after masking all the sensitive info) and let us know which field you would like to extract…we can take a look and help…

1 Like

Hi guys, thanks for the answers.

I’m posting here a sample of the label with masked information. The PDF files are always 1-side 1-page docs. Each provider generates its own certificate in the form of a PDF file and, therefore, each pdf file is different. When we receive the files at our company, we print the PDF, we stick the label where we find a suitable blank area and we then re-scan it all together generating the final PDF file.

The goal of the script should be to locate the label in the PDF and read the barcode. The difficulty here is that in one case the label can be found in the upper right corner, in another case in the bottom left and so on.

Thanks so much for your suggestions.Label

Hi guys…any hint from anybody? :roll_eyes:

1 Like

Hi !
Hum I have an idea but I don’t know if it works.
How many characters are there in a barcode ? Is it always the same number of characters ?
If yes, what we could imagine is using regular expressions to extract it !
Could you send us a txt sample of a pdf, you can anonimyze the data

Hi @Hiba_B ,

I was investigating a bit. The barcode should contain a number of 8 digits. Which is the same number displayed in the field Lote Insp also shown on the label (a bit upper). So, we could capture both, the number and the barcode. The main issue here is that the label is sticked to the PDF in a different position depending on the provider. And that means that a fixed position reference can not be used. I can’t imagine a way to troubleshoot this. :frowning:

Would you mind sending 2 txt with dummy data, and in these two txt the label should be at a different place ? So we could look for it together ^^

Hi…please find 2 samples, one with the label on top and another with the label at bottom, both marked in yellow.

Thanks !
But what I need is the “.txt” extracted from the pdf that you have.
To do so, use @prasath17 's first point, then put StrText in a Write text activity, then send us the text files that have been created ^^
If you need help for the workflow to convert .pdf into .txt let me know

1 Like

HI @jferre - Read the text file and store the output as StrInput. Then use the below code

Use StrOutput = Regex.Match(StrInput, “(?<=Lote Insp.:\s+)\d+).value.tostring” ==> Which will get the # for you.

from the below image you can see, I have used Lote Insp.: as keyword to find the digits next to it and it is finding from two different places…

Note:If Regex.match doesn’t recognize, on the import tab - pls import system.text.regularexpressions