Regex first lines after \r\n\r\n

Inputstring:
“Vat-nr. 457674456\r\nDepartment 4500 Pepsi Co\r\nEmployeeno. 6000 Avenue 5\r\nDate 18-11-11 Newark\r\n\r\nAccountant\r\nJohn Doe Smith\r\nTest Road 2\r\n1234 Test Town\r\n\r\nblabla\r\nblabla\r\nblabla\r\nblabla\r\n\r\nblablabla”

Output:
Accountant
John Doe Smith

Data: The title and name can vary.

I cant do a positive lookbehind with \r\n\r\n why not? :slightly_smiling_face:

Line breaks in Windows usually consist of \r\n but not all text files are formatted like that. Some files just have \n as line break. So it’s better to make \r optional by using the question mark operator.

Try this:

(?<=^Vat-nr\.(.|\r|\n\S)+(\r?\n){2})(.+\r?\n.+)

It’s using “Vat-nr.” as anchor and takes the two lines after the “Vat-nr”-paragraph.

image

2 Likes

Thank you, but my input is not as shown, but
“Vat-nr. 457674456\r\nDepartment 4500 Pepsi Co\r\nEmployeeno. 6000 Avenue 5\r\nDate 18-11-11 Newark\r\n\r\nAccountant\r\nJohn Doe Smith\r\nTest Road 2\r\n1234 Test Town\r\n\r\nblabla\r\nblabla\r\nblabla\r\nblabla\r\n\r\nblablabla”

I dont have visual line breaks, but the \r and \n

@LauraMM

This should give you a start (?<=\\r\\n\\r\\n|\\r\\n)([a-zA-Z\s]{1,})

1 Like

Seems great, but how how do I get Accountant and John Doe Smith out and not the other matches?

use a for each activity, and an if activity

condition could be: Not item.toLower.Equals("department") And Not item.toLower.Equals("employee") etc…

I would suggest that your format your input text so it has visual line breaks:

inputText = inputText.Replace("\r\n", Environment.NewLine)
employeeInfo = System.Text.RegularExpressions.Regex.Match(inputText, "(?<=^Vat-nr\.(.|\r|\n\S)+(\r?\n){2})(.+\r?\n.+)").Value
1 Like

Not a useful solution, since either department or employee is know.

Maybe let us understand, how do you get this data?

is it extracted from a pdf using an ocr?

Thanks for really helping and I think this is the right way to go. It doesnt work. I read my input from a PDF (text) and when I do
inputText = inputText.Replace("\r\n", Environment.NewLine)
it will still show up with the \r and \n in debug mode

Extracted from a pdf using Read PDF Text in UiPath :slightly_smiling_face:

Ignore the output in Debug mode, it’s showing in C# expression so seeing \r\n is normal. Test to print it to the console using Write Line or show it in a Message Box instead.

image

1 Like

Hey ptrobot. Your solution works perfectly, if I read the pdf text into a write line, copy it to a notepad or regex101 and copy it back into my inputText string variable.

But if I read directly from the pdf into the variable and then do the regex expression on it, it won’t work.

Thats strange?

Yes, that’s really strange. Do you have a sample PDF that doesn’t contain sensitive data you could upload here so we can take a look?

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.