Regex - How to do a positive lookahead, where the output should be (n) words behind the lookahead?


#1

I am receiving a system-generated PDF and need to pull out 2 different numbers in a row based only when the report says they are not an error. I think the best way is likely to use 2 regular expressions - 1 to pull out number1 and another to pull out number2.

I am struggling to find a regex solution that does the following:

Input text:
6056810 P 11282017 11272017 00000000 $ 666.26 POLICY MASTER SUPPLEMENT (AP012) MIN-DIST-INDICATOR DOES NOT = 'Y' REFERENCE DATE: 11/28/2017
7061977 P 11282017 11272017 00000000 $ 2,265.09
CV1 AMOUNT TO LOW TO PROCESS MINIMUM DISRTIBUTION
314283 P 11282017 11272017 00000000 $ 3,003.40 REFERENCE DATE: 11/28/2017

Desired output:
Match(0) = 7061977
Match(1) = 314283

The words “CV1 Amount…” or “Reference Date” will always follow the amount, so I am hoping to use that as the anchor and look back the correct number of words to pull out only the numbers I’m looking for. I was able to do this to pull out the $ amount by simply using a positive lookahead looking for those exact text. However, I’m not sure if it’s possible to look back (n) words? So I want to grab the word that is 7 words prior to the lookahead. Is that possible?


#2

Hey.

My initial thinking was that you have 7 fields so you could extract the 1st and 7th from those fields, so first thing would be to replace “ " to either "" or "” and get rid of the extra space. Then, you could just use .Split() to change the text to an array.

Another approach would be to extract the number that’s in the right format with decimals, since only one value has a decimal.

(sorry I was responding an hour ago and got sidetracked)


#3

I’d still be interested in how to do the regex lookahead as mentioned above, but I did end up going a different route due to time constraints. One note is that the words “CV1 Amount” or “Reference Date” have to immediately follow the $X.XX number, I don’t want to pick up the first row of my input text because it has the word “Policy” immediately after instead. If it’s possible to look back (n) words behind your lookahead, then it would be easy - I’m just not sure how to do that or if it’s possible.

Instead I just iterated through the string a few times and removed strings based on different criteria each time. Much messier but did the job quicker in the end since it only had to go through ~10k rows 3 times to get the 2 values


#4

Yeah, I’m not sure how you would find the 'n’th word in Regex, because I’m not an expert but would be interested to learn what pattern to use.

Normally you can use {n} like “[A-Z]{7}” would be 7 alphas, but for words you would need ((.*)\s), and I don’t know if {n} would work next to it.


#5

You might be able to use something like this to isolate the LINE above “CV1”:
(?<=\n)([\S\s]*)(?=CV1)

or possibly something like this to isolate the LINE containing the digit-space-REFERENCE DATE (using the “digit” here so it doesn’t get confused with the ‘Y’ REFERENCE DATE line):
(.*?)(?=\d REFERENCE DATE)

I didn’t test your data but I recently ran into a similar issue with multiline text document that I describe here which explains how I think UiPath’s treatment of (.*?) and ([\S\s]*) might be used to your advantage.

Once you have the line isolated, you could split by spaces to get to the desired output.