I am trying to build a robot that opens unstructured PDF invoices and gets specific values like the date of the invoice. I use google OCR to convert the image PDF to text which works fine but then I am stuck.
The value I want to extract always starts with the term β(12)β and right from this term the date is located. Sometimes there is also another word inbetween like in this example:
If you are extracting specific values from PDF, best method is to use Regular Expresssions
Use Read pdf text and use match method of system.text.regularexpressions
Suppose your string is β(12)Date:07.02.2019β and in this scenario using the below Regex and as you told βDateβ is optional , so if you remove the date from the string then also you will be able to extract the date using the below regex
thanks for your reply.
I already struggle with extracting the string β(12)Date:07.02.2019β. I have got the following string after converting the PDF to text with OCR:
(12) Date: 07.02.2019 (51) 000 753 999
(21) Signature
(43) Example text
You can try the a regex pattern similar to this one: (12) Date: (\d{2}.?\d{2}.?\d{4}.?)
Tip: you can try them all out in online regex testers such as https://regex101.com/
thanks for your help. I did an Regex tutorial to understand the code but I have got a problem with the matches activity. I would like to store the result in a variable and therefore I defined a string variable which I entered in the βResultβ box of the matches activity. But I get an error as string seems not to be the right variable type. Also every other type I tried did not work (e.g. array).
How can I store the result of the match in a variable?
This is basically a collection of all the results that your regex pattern returns. To obtain the string values, youβll need to loop through this with a for each. As type you have to select
system.text.RegularExpressions.match
Inside each item in the for each, you can then access the string. I personally always capture the string I want with a regex group (i.e between brackets β () ). You can use a syntax like item.groups(%x%).tostring with %x% being the number of the group youβre interested in