Help with RegExp

P_S · September 12, 2019, 2:35pm

Hi all,

I have read in the text from a PDF and a part of the content looks like this:

Belopp 
Konto Ansvar Projekt Verks Produkt Objekt Moms exkl moms

6419 94205 1 40700 4220 209,30 837,20

209,30
1677 SUMMA MOMS

1.046,50
2510 ATT UTBETALA
CAN BE MORE TEXT/NUMBERS HERE

I would want to extract the value “209,30” before “SUMMA MOMS”
and also the value “1.046,50” before “ATT UTBETALA”.

All the numbers above can be different each time.

How can I get this done in the easiest way?

supermanPunch · September 12, 2019, 2:48pm

@P_S Can you give us more similar data

P_S · September 12, 2019, 2:54pm

All the number in that can be different, byt all the text is always the same.
Here’s another example:

Belopp 
Konto Ansvar Projekt Verks Produkt Objekt Moms exkl moms

5439 54205 4 43210 5210 310,45 1957,30

310,45
2677 SUMMA MOMS

22.356,89
3520 ATT UTBETALA

CAN BE MORE TEXT/NUMBERS HERE

So here I would want to have the number on the row above “SUMMA MOMS”, which is “310,45”
and the number on the row above “ATT UTBETALA” = “22.356,89”

supermanPunch · September 12, 2019, 3:10pm

@P_S Ohh if SUMMA MOMS and ATT UTBETALA are constant then we can retrieve those values

Dave · September 12, 2019, 3:18pm

@p_s This regex statement should find the digits (including comma) one line above the word SUMMA MOMS [0-9,.]+(?=\r\n.*SUMMA MOMS)

And this regex will do the same, but looks for the line above the word ATT UTBETALA
[0-9,.]+(?=\r\n.*ATT UTBETALA)

Amine_Tazlaft · September 12, 2019, 3:38pm

You can use this Regex (\d+,\d+)(?:\n\d+ SUMMA MOMS) to get th first number, and this one (\d+,\d+)(?:\n\d+ ATT UTBETALA) for the second number.

You can use this editor : https://regex101.com to test the Regex before using it in UiPath.

supermanPunch · September 12, 2019, 3:54pm

@P_S I have a bit complicated regex But I feel this works too, if you just want to try.
Regex for SUMMA MOMS:
((\d{4,})|((\d+(,|.))+\d{2,}))\s*(?:\n\d+\sSUMMA\sMOMS)

Regex for ATT UTBETALA:
((\d{4,})|((\d+(,|.))+\d{2,}))\s*(?:\n\d+\sATT\sUTBETALA)

You ca use Matches Activity and use this regex and get the 3rd Group. For Further help Just ask

Dave · September 12, 2019, 4:02pm

I believe @Amine_Tazlaft is using a different ‘flavor’ of regex so the syntax is slightly wrong for vb.net. I think you just need to change your positive lookahead from ?: to ?= is all though. However, the (\d+,\d+) portion does not work if there are no commas, and it doesn’t work if it includes a period. The solution [0-9,.]+ is able to handle all number formats without issue, although it will grab invalid numbers if it contains more than one decimal identifier.

Also, for the lookahead you’ll need to include carriage return \r in addition to newline \n. For vb.net specific regex testing I highly recommend using .NET Regex Tester - Regex Storm instead of regex101.com

@supermanPunch this solution is working (although you also have to update the syntax for positive lookahead), but it seems to add complexity where it isn’t needed with the multiple groupings and inline ifs. Is there any advantage to it over using [0-9,.]+ that I am missing?

supermanPunch · September 12, 2019, 4:14pm

@Dave I believe in your case this should have been the regex
([0-9,.]+)\n([0-9,.]+)\s*(ATT\sUTBETALA)

P_S · September 12, 2019, 4:28pm

Thanks @Dave, this seems to work fine!

Can you also give me advice on the following:

FAKTURA
Fakturanr 3732 Kundnr 288

TEST ABC
Fakturadatum 2019-08-29

Here I want to have the value after “Fakturanr”, that is only the value “3732”.

And also the following case:

UTBETALNING

Fakturadatum Förfallodatum Ref nr Fakturanummer

2019-09-05 2019-09-12 190905CS

Betalningsmottagare

Where I actually want the value of “Fakturanummer”… however, that can be found on the row below and has the value “190905CS”.

Many thanks in advance!

supermanPunch · September 12, 2019, 4:38pm

@P_S Can you provide similar data , It’s always needed to confirm the regex is working properly or not

Dave · September 12, 2019, 4:41pm

Sure thing!

The first one is simpler and is searching for the word “Fakturanr” along with a single blank space, then is grabbing all of the digits that occur after it. The regex expression to do this is: (?<=Fakturanr )\d+

The next one is a little more complicated. I will make 4 assumptions as follows, please let me know if any assumptions could be incorrect:

The value you want will always be 1 or 2 lines below “Fakturanummer”
“Fakturanummer” is always at the end of the line,
The value you want is always at the end of the line
The value you want does not contain blank spaces anywhere within it.

Based on those 4 assumptions, you could do it in a single regex, but I would instead break it into 2 regex statements. The first would pull the entire line containing the value that you want, and the 2nd regex would grab just the value you want. Regex 1 (to pull out entire line) is: (?<=Fakturanummer(\r\n){1,2})\S.+ Then to pull out just the last portion, regex 2 would be: \S+$

supermanPunch · September 12, 2019, 5:01pm

@Dave Shouldn’t there be a group for \d+ in your regex (?<=Fakturanr )\d+

Dave · September 12, 2019, 5:06pm

@supermanPunch go ahead and give all of my solutions a try. Every one is working using the assumptions I have provided and the input text provided by OP. I’d recommend .NET Regex Tester - Regex Storm personally, but any regex tester is fine as long as it is using VB.NET regex.

There is no need to group \d+, but you certainly can if you want.

Since OP just wants to pull out a single value, I would simply assign MyValueAsString = Regex.Match(InputString,"(?<=Fakturanr )\d+").Value This would give your string variable named MyValueAsString the value you need.

P_S · September 12, 2019, 6:25pm

Thanks @Dave! I have one more to you if you’re up for it
Or… anyone else.

Invoice
Invoice number: 10367140
Invoice date: 2 September 2019

Here I want to get the number after “Invoice number”, that is “10367140”.
I tried with the following
"(?<=(Fakturanummer|Fakturanr|Invoice number:) )\d+"
But can’t get it to work.

Dave · September 12, 2019, 8:21pm

Sure thing! Here is the regex statement that would work: (?<=Invoice number: )\d+

(?<=) - this is called a positive lookbehind. Whatever we include here will be found in the input string, then the portion to the right of this lookbehind will be our match. Since we had (?<=Invoice number: ) that means it is looking for the word “Invoice number” followed by a colon and a space before capturing the info to the right
\d+ this means we want 1 or more digits. If there is anything other than digits it would not be caught, so 10357-3834 it would only match 10357

P_S · September 12, 2019, 8:27pm

I just realized, that there must be some character in this original text that wasn’t included in my post.
So please look at the following file: https://ufile.io/r70jru0b
With this content, it still doesn’t work.

Note: I got it to work with the following (matching other words as well that I need):

(?<=(Fakturanummer|Fakturanr|Invoice number).*)\d+

Dave · September 12, 2019, 8:37pm

It appears to be the whitespace character. The one in your textfile is different than the one you get pressing the spacebar. This can be compensated by using \s* so the expression would be: (?<=Invoice number:\s*)\d+

However, the solution you have is working as well so use either one

system · September 15, 2019, 8:39pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Help with RegExp (read from PDF) Help	2	744	October 5, 2019
Help with RegExp. (read from PDF) Help	12	1227	October 13, 2019
RegEx syntax Help regex	5	820	July 23, 2020
Regex, extract data from string Studio	12	1432	April 2, 2021
Specific Data from PDF sheet Help	30	1758	September 2, 2019

Help with RegExp

Related topics