Extract text from webpage and keep line breaks

Hello,

I want to extract some text from a web page and I want to keep this format (with line breaks).
image

The Get text activity returns a string with all the words on the same line when i write it to a notepad. Same with the Get Full Text activity. I inspected the element and it seems that the steps ( 1),2),3),4) ) are separated with line breaks
.

I also cannot use ocr activities because for my scenario, the text wont be visible for the robot (but the webpage will be loaded and the get text activities work).

Is there any other way to extract and preserve this format of the text?
Thank you

@Nestor_Gabriel_Lucian1

Did you try get visible text?

alternately instead of getting innertext try gettting inner html and then we can see the separation and can separate it based on <br> if present

cheers

we assume that the text is a numbered list

So from:
grafik

With:
grafik

We got:
grafik

Text with Line breaks

What was modeled at your end? How looks the HTML snippet?

I got this when i extracted the innerhtml:

  1. key on
    2) go in NIGHT MODE
    3) go in DAY MODE
    4) Check dimming soft telltale.

Here is a photo since the actual text is not preserved when posting a comment

I guess I can use a regex and remove the content inside the brackets and split the lines by whats left (β€œ<>”).

Thank you. My text won’t be always numbered and I cant rely on a regex that tries to find numbers followed by β€œ)” or β€œ.”.

1 Like

I followed your example and I got false on that condition. It seems like the modern get text does the same as the classic one. Maybe you had a different setup. I tested it on a note pad and I still got the false condition.

please share with us

image

@Nestor_Gabriel_Lucian1

You can try this regex, if that helps

System.Text.RegularExpressions.Regex.Replace(String.Join(Environment.NewLine,System.Text.RegularExpressions.Regex.Split(str,"<.*>")),"(\r?\n)+",Environment.NewLine)

image

cheers

Thank you! It worked perfectly.

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.