Extracting specific values from changing image PDF

Hello,

I am trying to build a robot that opens unstructured PDF invoices and gets specific values like the date of the invoice. I use google OCR to convert the image PDF to text which works fine but then I am stuck.

The value I want to extract always starts with the term “(12)” and right from this term the date is located. Sometimes there is also another word inbetween like in this example:

How can I extract the date?
The position of the date varies from invoice to invoice, so I can not use the location to extract the value.

Many thanks in advance.
Juka

1 Like

Hi,

If you are extracting specific values from PDF, best method is to use Regular Expresssions
Use Read pdf text and use match method of system.text.regularexpressions

Suppose your string is “(12)Date:07.02.2019” and in this scenario using the below Regex and as you told “Date” is optional , so if you remove the date from the string then also you will be able to extract the date using the below regex

(?<=[(]12[)]\s+(Date\s+)?:\s+)(\d{2}.\d{2}.\d{4})

2 Likes

Hi @anil5,

thanks for your reply.
I already struggle with extracting the string “(12)Date:07.02.2019”. I have got the following string after converting the PDF to text with OCR:

(12) Date: 07.02.2019 (51) 000 753 999
(21) Signature
(43) Example text

How can I extract the string you mentioned?

Best regards

1 Like

Hi Juka,

You can try the a regex pattern similar to this one: (12) Date: (\d{2}.?\d{2}.?\d{4}.?)
Tip: you can try them all out in online regex testers such as https://regex101.com/

2 Likes

Hi @MatthiasVG,

thanks you for your help.
I think thats exactly what I was missing. I will try it and let you know how it worked.

1 Like

please try this
(?<=[(]12[)]\s+(Date)?:\s+)(\d{2}.\d{2}.\d{4})

2 Likes

Hi,

thanks for your help. I did an Regex tutorial to understand the code but I have got a problem with the matches activity. I would like to store the result in a variable and therefore I defined a string variable which I entered in the “Result” box of the matches activity. But I get an error as string seems not to be the right variable type. Also every other type I tried did not work (e.g. array).

How can I store the result of the match in a variable?

Best regards
Juka

Hi Juka,

The result of a matches function are stored in a

system.collections.generic.IEnumerable<system.text.RegularExpressions.match>

This is basically a collection of all the results that your regex pattern returns. To obtain the string values, you’ll need to loop through this with a for each. As type you have to select

system.text.RegularExpressions.match

Inside each item in the for each, you can then access the string. I personally always capture the string I want with a regex group (i.e between brackets -> () ). You can use a syntax like item.groups(%x%).tostring with %x% being the number of the group you’re interested in

Honestly if the invoices are unstructured I would use a dedicated digitization solution with UiPath like Abbyy Flexicapture.