I really would like to get your approaches about the following use.
The main thing is a PDF file in which providers stick a label which always have same format. Of course data on that label is different based on provider, analyzed product, inspection batch, etc…but the label has always same layout and format.
One of the informations on that label is a barcode located always at the bottom of the label containing the key data I need to get. The problem here is that every provider sticks the label in a diferent area on the PDF file so, that makes it quite difficult to find the barcode.
How would you approach this case? Please let me know if you need further details for your better understanding. And, no, I cannot share a sample file as there’s confindential information on it.
Hi @jferre !
If the pdf is not an image, then I would have converted it into txt, and compared several txt to see if there is a common element before and after the label. Then I would have used Substring method or Regex to fetch the data.
Otherwise, I think that machine learning would suite your case (like making a neural network learn what is a label, giving it as much samples as possible and fetching the wanted data) but i don’t know how does it work with uipath, and the neural network cannot be created quickly
First, use Read PDF Text activity using “Preserve Format” set to True and save the output file as StrText
Use StrText in "Matches" activity if you are using Regex or use any other string manipulation method…
If could share the sample text file(after masking all the sensitive info) and let us know which field you would like to extract…we can take a look and help…
I’m posting here a sample of the label with masked information. The PDF files are always 1-side 1-page docs. Each provider generates its own certificate in the form of a PDF file and, therefore, each pdf file is different. When we receive the files at our company, we print the PDF, we stick the label where we find a suitable blank area and we then re-scan it all together generating the final PDF file.
The goal of the script should be to locate the label in the PDF and read the barcode. The difficulty here is that in one case the label can be found in the upper right corner, in another case in the bottom left and so on.
Hi !
Hum I have an idea but I don’t know if it works.
How many characters are there in a barcode ? Is it always the same number of characters ?
If yes, what we could imagine is using regular expressions to extract it !
Could you send us a txt sample of a pdf, you can anonimyze the data
I was investigating a bit. The barcode should contain a number of 8 digits. Which is the same number displayed in the field Lote Insp also shown on the label (a bit upper). So, we could capture both, the number and the barcode. The main issue here is that the label is sticked to the PDF in a different position depending on the provider. And that means that a fixed position reference can not be used. I can’t imagine a way to troubleshoot this.
Thanks !
But what I need is the “.txt” extracted from the pdf that you have.
To do so, use @prasath17 's first point, then put StrText in a Write text activity, then send us the text files that have been created ^^
If you need help for the workflow to convert .pdf into .txt let me know