Multiline Regex.Replace contents between two key words within text document


#1

I banged my head on the wall for a few hours today trying to figure out why my Regex Pattern was working when using a lookahead and lookbehind on a set of keywords for selecting text in between the keywords on the same line, but it continued to fail within a multiple-line text document.

The example below is meant to select all text between " person–" and the words “Not over” from my document (but these words appear on different lines).

Regex.Replace(readPDFText,"(?<=" & " person—" & “)(.*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

I finally figured out that the period (.*?) in vb.net version of regex references all characters accept “new line” (/n). So if you find yourself facing the same problem,
changing:
(.*?)
to
([\s\S]*?)
did the trick.

Regex.Replace(readPDFText,"(?<=" & " person—" & “)([\s\S]*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

Hope this helps someone!


Reading larger text file and extracting certail part of data from it
Regex - How to do a positive lookahead, where the output should be (n) words behind the lookahead?
#2

@mr_steve Can you please provide sample input and expected output that you are trying to achieve


#3

Hi @indra,

Actually my original post was not a question but a solution: I just posted it in case someone else ran into the same issue with Regex patterns that contain line breaks (in this case I was replacing text between a lookahead/lookbehind).

The code that works is:
Regex.Replace(readPDFText,"(?<=" & " person—" & “)([\s\S]*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

(shown above with ampersands for clarity sake but this can be reduced to:)

Regex.Replace(readPDFText,"(?<=person—)([\s\S]*?)(?=Not over)".Environment.NewLine

In case you’re still interested, the original Text looked like this:

TABLE 1—WEEKLY Payroll Period
(a) SINGLE person (including head of household) (b) MARRIED person—

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:
Not over Over But not over Bracket Base Percentage of excess over Over But not over Bracket Base Percentage of excess over
$71 $0.00 0.00% $0 $0 $222 $0.00 0.00%
$71 $254 $0.00 10% $71 $222 $588 $0.00 10% $222
$254 $815 $18.30 12% $254 $588 $1,711 $36.60 12% $588
$815 $1,658 $85.62 22% $815 $1,711 $3,395 $171.36 22% $1,711
$1,658 $3,100 $271.08 24% $1,658 $3,395 $6,280 $541.84 24% $3,395
$3,100 $3,917 $617.16 32% $3,100 $6,280 $7,914 $1,234.24 32% $6,280
$3,917 $9,687 $878.60 35% $3,917 $7,914 $11,761 $1,757.12 35% $7,914
$9,687 $2,898.10 37% $9,687 $11,761 $3,103.57 37% $11,761

It’s tab delimited data showing US Federal Income Tax brackets for weekly paychecks. The goal was to remove the text between “MARRIED person—” and the line starting “Not over Over But not over”. The same text I wished to delete repeats itself multiple times through then entire document (readPDFText) in between these same keywords.

What was originally messing me up were the line breaks: UiPath’s Replace Activity wasn’t working because of the line breaks which led me to use Regex.Replace

To confuse matters further it’s my understanding that regex for javascript recognizes (.*?) (or more specifically the special character “period” “.”) as “all characters including line-breaks” whereas the same regex syntax for VB does not recognize line-breaks (“period” equals “all characters except for line breaks”) … so (.*?) (see original post) kept failing. It also was not working for (.\n*?) so my line breaks were not getting found by these regex patterns.

Instead, the answer I found is that you can use ([\S\s]*?) which accounts for all Whitespace and all Non-Whitespace characters.

The result simply deletes the following:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

Hope that’s more clear; I just wanted to post in case anyone else ran into similar problems with Whitespace and/or Hidden Characters while parsing text!


#4

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.