Multiline Regex.Replace contents between two key words within text document

mr_steve · December 15, 2018, 4:15am

I banged my head on the wall for a few hours today trying to figure out why my Regex Pattern was working when using a lookahead and lookbehind on a set of keywords for selecting text in between the keywords on the same line, but it continued to fail within a multiple-line text document.

The example below is meant to select all text between " person–" and the words “Not over” from my document (but these words appear on different lines).

Regex.Replace(readPDFText,"(?<=" & " person—" & “)(.*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

I finally figured out that the period (.*?) in vb.net version of regex references all characters accept “new line” (/n). So if you find yourself facing the same problem,
changing:
(.*?)
to
([\s\S]*?)
did the trick.

Regex.Replace(readPDFText,"(?<=" & " person—" & “)([\s\S]*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

Hope this helps someone!

indra · December 15, 2018, 4:18am

@mr_steve Can you please provide sample input and expected output that you are trying to achieve

mr_steve · December 17, 2018, 1:25am

Hi @indra,

Actually my original post was not a question but a solution: I just posted it in case someone else ran into the same issue with Regex patterns that contain line breaks (in this case I was replacing text between a lookahead/lookbehind).

The code that works is:
Regex.Replace(readPDFText,"(?<=" & " person—" & “)([\s\S]*?)(?=" & "Not over" & ")",Environment.NewLine).ToString

(shown above with ampersands for clarity sake but this can be reduced to:)

Regex.Replace(readPDFText,"(?<=person—)([\s\S]*?)(?=Not over)".Environment.NewLine

In case you’re still interested, the original Text looked like this:

TABLE 1—WEEKLY Payroll Period
(a) SINGLE person (including head of household) (b) MARRIED person—

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:
Not over Over But not over Bracket Base Percentage of excess over Over But not over Bracket Base Percentage of excess over
$71 $0.00 0.00% $0 $0 $222 $0.00 0.00%
$71 $254 $0.00 10% $71 $222 $588 $0.00 10% $222
$254 $815 $18.30 12% $254 $588 $1,711 $36.60 12% $588
$815 $1,658 $85.62 22% $815 $1,711 $3,395 $171.36 22% $1,711
$1,658 $3,100 $271.08 24% $1,658 $3,395 $6,280 $541.84 24% $3,395
$3,100 $3,917 $617.16 32% $3,100 $6,280 $7,914 $1,234.24 32% $6,280
$3,917 $9,687 $878.60 35% $3,917 $7,914 $11,761 $1,757.12 35% $7,914
$9,687 $2,898.10 37% $9,687 $11,761 $3,103.57 37% $11,761

It’s tab delimited data showing US Federal Income Tax brackets for weekly paychecks. The goal was to remove the text between “MARRIED person—” and the line starting “Not over Over But not over”. The same text I wished to delete repeats itself multiple times through then entire document (readPDFText) in between these same keywords.

What was originally messing me up were the line breaks: UiPath’s Replace Activity wasn’t working because of the line breaks which led me to use Regex.Replace

To confuse matters further it’s my understanding that regex for javascript recognizes (.*?) (or more specifically the special character “period” “.”) as “all characters including line-breaks” whereas the same regex syntax for VB does not recognize line-breaks (“period” equals “all characters except for line breaks”) … so (.*?) (see original post) kept failing. It also was not working for (.\n*?) so my line breaks were not getting found by these regex patterns.

Instead, the answer I found is that you can use ([\S\s]*?) which accounts for all Whitespace and all Non-Whitespace characters.

The result simply deletes the following:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

(after subtracting
withholding allowances) is:
The amount of income tax
to withhold is:

Hope that’s more clear; I just wanted to post in case anyone else ran into similar problems with Whitespace and/or Hidden Characters while parsing text!

system · December 20, 2018, 1:25am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Reading text between two lines Help	7	5648	October 29, 2018
Regex pattern to extract text in-between 2 lines (which contains fixed keywords) Help activities , regex , question , word	14	4553	January 28, 2021
Regex in between lines Help activities	8	1378	October 19, 2020
Regex first lines after \r\n\r\n Help activities	14	1409	November 20, 2020
Regex - Postive lookahead and lookbehind - Retrieve text in newlines Studio	2	1274	November 8, 2020

Multiline Regex.Replace contents between two key words within text document

Related topics