Extract Specific Data From Multiple PDF To Excel

Hello everyone,
I am new to this, and I have a little problem and need any help.
So, what I am trying to do is to extract some informations from some pdf files, invoices.
The problem is with product code from that invoces, as it might be an invoice with only one product, and other invoice with more than one.
I want to be able to loop through each product, take the neccesary info, and save them.
But I can`t get the correct product code from different invoices, and better Ill show you in images why:

The green rectangles are the informations that I need, but with my code, cant get the right "anchor". For example, i tried the word "Buc", and tried to get the first word from the next row, but as you can see in the pictures, the red rectangles idicates that sometimes there are only two rows of text there, sometimes 3. Tried to anchor product code with the word "EAN", and the the first word from previous row, or the row above, but didnt work for me.
i`ll put the regex code as an example as well (this code is into an Assign module):

“System.Text.RegularExpressions.Regex.Match(strPDFOut, “(?s)Buc.?\n.?\n(\w+)\b”).Groups(1).Value”

as you may see here:

I`ll attach the main file, as well.

Main.xaml (15.1 KO)

Any help would be appreciated.

Thanks in advance!

Hi @Marius_M You can use the splict function with the help of line Index and get your data. Without using specfic word try to use line index number.

Thanks
Armila

Thanks for your reply.
I havent work yet with line index, but Ill look into it and let you know if it worked.

Thanks!

I managed to find the right regex code, to pull the product code, which is the first word from the row above.
Now, ill have to find the right loop through the whole file/pdf to take all the product codes, but thats another problem
I`ll leave the code here for anyone who might need it:

“strPDFOut.Split({Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries).TakeWhile(Function(line, index) index < Array.IndexOf(strPDFOut.Split({Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries), strPDFOut.Split({Environment.NewLine}, StringSplitOptions.RemoveEmptyEntries).FirstOrDefault(Function(l) l.Contains(“EAN”)))).LastOrDefault().Split(” “c).FirstOrDefault()”

where strPDFOut is a string variable.

Topic closed.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.