Need help with idea for string manipulation

Hello All!

Looking for any help to solve my problem. So im reading whole pdf with OCR and i need to extract couple positions. The problem is with cash amounts, when the amount is less then >1k its fine but above it its problematic because im spliting whole string by spaces and extracting data by index, when the amount is bigger for example 1 250,50 its splitting it like index(0) - 1 and index(1) 250,50. Which is messing whole idea.

Example how extracted ocr string looks like:

  1. installment number 2 for invoice numer AAA/BBB CC DDD (1 050,50) (23) (241,60) (1 292,42)

I have put in the brackets to show that i have always 4 numeric positions to extract + rest of data like invoice number etc, the order is always the same, but the amounts are not, so i have splited by spaces and it was fine, like last position is non-tax, etc just took last position and assigned as non-tax, but when the ammount is bigger than one thousand it`s splitting the amount and indexes are moved.

give a try on using Regex additional for your extraction strategy:

\b(\d+,?\ ?)+\b

Thank you but, i have now data extracted as below:

So i have one string and i need 4 separated values:

1 021,01
1 255,84

The only fixed number is in invoices is: 23, rest of the values are random and unpredicatble. is there any change to extract all 4 values to variables like Val1, Val2… ?

yes, we would do it with regex and then e.g let return an array with the found values. As the occurence is also not fixed an approach like var1,var2… we would not recommend

getting an array with all regex matches woulfd look like this:

Assign activity:
left: arrAllMatches | Datatype: String ( ) - a string array

Regex.Matches(yourStringVar, "\b(\d+,?\ ?)+\b").Cast(Of Match).Select(function (x) x.toString).ToArray()

ensure following:

So my string output is like:

1 021,01 23 234,83 1 255,84
179,88 23 41,37 221,25

I just dont get how to extract values when there are thousands, because there is blank space "1 021,01". I dont think separating by comma would work beacause there is also value “23” which just dont have comma. Im totaly stuck there.

In your first post you mentioned that the input you get is in the following format:

installment number 2 for invoice numer AAA/BBB CC DDD (1 050,50) (23) (241,60) (1 292,42)

In that case a RegEx as @ppr have mentioned or something like the below would work

If the string you need to do the RegEx on is as your suggest in your later posts,

1 021,01 23 234,83 1 255,84
179,88 23 41,37 221,25

The it suddenly gets tricky. Do you expect there to ever be a case where the number consist of more than 4 digits in the first part? Or is 9 999,99 the absolute max?

Indeed it`s tricky, and yes i expect that amount above 4 digits will occur anywhere. Ammount changes per invoce, it may be 3 digits (hundreds) but do not i do not exclude posiblity to be (thousands) 4 or more digits.

In that case, it all comes down to how the string you will do the matching against looks like.
If it looks like the below example, I don’t think you will succeed

1 021,01 23 234,83 1 255,84
179,88 23 41,37 221,25

If you are able to get some more information/data/characters in the string you will be matching against, then maybe your chances would increase.