Help with RegExp

Hi all,

I have read in the text from a PDF and a part of the content looks like this:

Belopp 
Konto Ansvar Projekt Verks Produkt Objekt Moms exkl moms

6419 94205 1 40700 4220 209,30 837,20

209,30
1677 SUMMA MOMS

1.046,50
2510 ATT UTBETALA
CAN BE MORE TEXT/NUMBERS HERE

I would want to extract the value “209,30” before “SUMMA MOMS”
and also the value “1.046,50” before “ATT UTBETALA”.

All the numbers above can be different each time.

How can I get this done in the easiest way?

@P_S Can you give us more similar data :sweat_smile:

All the number in that can be different, byt all the text is always the same.
Here’s another example:

Belopp 
Konto Ansvar Projekt Verks Produkt Objekt Moms exkl moms

5439 54205 4 43210 5210 310,45 1957,30

310,45
2677 SUMMA MOMS

22.356,89
3520 ATT UTBETALA

CAN BE MORE TEXT/NUMBERS HERE

So here I would want to have the number on the row above “SUMMA MOMS”, which is “310,45”
and the number on the row above “ATT UTBETALA” = “22.356,89”

@P_S Ohh if SUMMA MOMS and ATT UTBETALA are constant then we can retrieve those values :slightly_smiling_face:

@p_s This regex statement should find the digits (including comma) one line above the word SUMMA MOMS [0-9,.]+(?=\r\n.*SUMMA MOMS)

And this regex will do the same, but looks for the line above the word ATT UTBETALA
[0-9,.]+(?=\r\n.*ATT UTBETALA)

1 Like

You can use this Regex (\d+,\d+)(?:\n\d+ SUMMA MOMS) to get th first number, and this one (\d+,\d+)(?:\n\d+ ATT UTBETALA) for the second number.

You can use this editor : https://regex101.com to test the Regex before using it in UiPath.

@P_S I have a bit complicated regex :sweat_smile: But I feel this works too, if you just want to try.
Regex for SUMMA MOMS:
((\d{4,})|((\d+(,|.))+\d{2,}))\s*(?:\n\d+\sSUMMA\sMOMS)

Regex for ATT UTBETALA:
((\d{4,})|((\d+(,|.))+\d{2,}))\s*(?:\n\d+\sATT\sUTBETALA)

You ca use Matches Activity and use this regex and get the 3rd Group. For Further help Just ask :slightly_smiling_face:

I believe @Amine_Tazlaft is using a different ‘flavor’ of regex so the syntax is slightly wrong for vb.net. I think you just need to change your positive lookahead from ?: to ?= is all though. However, the (\d+,\d+) portion does not work if there are no commas, and it doesn’t work if it includes a period. The solution [0-9,.]+ is able to handle all number formats without issue, although it will grab invalid numbers if it contains more than one decimal identifier.

Also, for the lookahead you’ll need to include carriage return \r in addition to newline \n. For vb.net specific regex testing I highly recommend using .NET Regex Tester - Regex Storm instead of regex101.com

@supermanPunch this solution is working (although you also have to update the syntax for positive lookahead), but it seems to add complexity where it isn’t needed with the multiple groupings and inline ifs. Is there any advantage to it over using [0-9,.]+ that I am missing?

1 Like

@Dave I believe in your case this should have been the regex
([0-9,.]+)\n([0-9,.]+)\s*(ATT\sUTBETALA)

Thanks @Dave, this seems to work fine!

Can you also give me advice on the following:

FAKTURA
Fakturanr 3732 Kundnr 288

TEST ABC
Fakturadatum 2019-08-29 

Here I want to have the value after “Fakturanr”, that is only the value “3732”.

And also the following case:

UTBETALNING

Fakturadatum Förfallodatum Ref nr Fakturanummer

2019-09-05 2019-09-12 190905CS

Betalningsmottagare 

Where I actually want the value of “Fakturanummer”… however, that can be found on the row below and has the value “190905CS”.

Many thanks in advance!

1 Like

@P_S Can you provide similar data :sweat_smile:, It’s always needed to confirm the regex is working properly or not

Sure thing!

The first one is simpler and is searching for the word “Fakturanr” along with a single blank space, then is grabbing all of the digits that occur after it. The regex expression to do this is: (?<=Fakturanr )\d+

The next one is a little more complicated. I will make 4 assumptions as follows, please let me know if any assumptions could be incorrect:

  1. The value you want will always be 1 or 2 lines below “Fakturanummer”
  2. “Fakturanummer” is always at the end of the line,
  3. The value you want is always at the end of the line
  4. The value you want does not contain blank spaces anywhere within it.

Based on those 4 assumptions, you could do it in a single regex, but I would instead break it into 2 regex statements. The first would pull the entire line containing the value that you want, and the 2nd regex would grab just the value you want. Regex 1 (to pull out entire line) is: (?<=Fakturanummer(\r\n){1,2})\S.+ Then to pull out just the last portion, regex 2 would be: \S+$

1 Like

@Dave Shouldn’t there be a group for \d+ in your regex (?<=Fakturanr )\d+ :open_mouth:

@supermanPunch go ahead and give all of my solutions a try. Every one is working using the assumptions I have provided and the input text provided by OP. I’d recommend .NET Regex Tester - Regex Storm personally, but any regex tester is fine as long as it is using VB.NET regex.

There is no need to group \d+, but you certainly can if you want.

Since OP just wants to pull out a single value, I would simply assign MyValueAsString = Regex.Match(InputString,"(?<=Fakturanr )\d+").Value This would give your string variable named MyValueAsString the value you need.

1 Like

Thanks @Dave! I have one more to you if you’re up for it :slight_smile:
Or… anyone else.

Invoice
Invoice number: 10367140
Invoice date: 2 September 2019  

Here I want to get the number after “Invoice number”, that is “10367140”.
I tried with the following
"(?<=(Fakturanummer|Fakturanr|Invoice number:) )\d+"
But can’t get it to work.

Sure thing! Here is the regex statement that would work: (?<=Invoice number: )\d+

(?<=) - this is called a positive lookbehind. Whatever we include here will be found in the input string, then the portion to the right of this lookbehind will be our match. Since we had (?<=Invoice number: ) that means it is looking for the word “Invoice number” followed by a colon and a space before capturing the info to the right
\d+ this means we want 1 or more digits. If there is anything other than digits it would not be caught, so 10357-3834 it would only match 10357

I just realized, that there must be some character in this original text that wasn’t included in my post.
So please look at the following file: https://ufile.io/r70jru0b
With this content, it still doesn’t work.

Note: I got it to work with the following (matching other words as well that I need):

(?<=(Fakturanummer|Fakturanr|Invoice number).*)\d+

It appears to be the whitespace character. The one in your textfile is different than the one you get pressing the spacebar. This can be compensated by using \s* so the expression would be: (?<=Invoice number:\s*)\d+

However, the solution you have is working as well so use either one :slight_smile:

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.