Extract text using Uipath RegEx Builder

Hi All,

From the following text:

\n<img alt=\"SAMPLE TEXT NUMBER 1" data-largest=\"https://cms-imgp.TEST-cdn.org/img/p/2024322/univ/art/2024322_univ_cnt_4_lg.jpg\" data-smallest=\"https://cms-imgp.TEST-cdn.org/img/p/2024322/univ/art/2024322_univ_cnt_4_xs.jpg\" src=\"https://cms-imgp.TEST-cdn.org/img/p/2024322/univ/art/2024*"

how can I write the following RegEx Expresion using UiPath RegEx Builder:

\b(?<=alt"β€œ)[\s\S]*?(?=”")

RegEx
grafik

Hi,

It seems missing = after alt

Can you try either of the following?

image

\b(?<=alt=")[\s\S]*?(?=")

OR

image

\b(?<=alt=\u0022)[\s\S]*?(?=\u0022)

Regards,

lets link to the preparing discussion

When porting it to the Matches Activity

Teststring has to be single " As it represents the HTML code and is not the escaped one from the immediate panel

pattern has to be a single quote as it has not to escape the string from a .Net String assignment

None of the solutions worked yet :frowning:

Im getting the value directly from the webpage using "

Hi,

Can you share your string as text file using WriteTextFile activity?
Exact string is necessary to extract it.

Regards,

As expected we can execute and it is extracting

We hope that all adaptations/alignments are done correctly at your end as elaborated by us

Still, we do see, that the test string is of wrong format and is not a return e.g. from an outerhtml attribute

Thanks for the guide. I found that the original text from the web was trimmed when I Created the Dictionaty.

Please find attached the file created using WriteTextFile activity

txtOuter.txt (732 Bytes)

1 Like

Hi,

In my environment, the following works.

System.Text.RegularExpressions.Regex.Match(strData,"\b(?<=alt=\u0022)[\s\S]*?(?=\u0022)").Value

Regards,

1 Like


also have a look here:

and keep in mind:
grafik

in some debugging panel we got visualized a string by \" or ""
but is not part of the content / variable value

Within Assignments for Strings we do need to escape inner " with a second one
(Demonstrated with ""Super"")

Hi @M_G_C

System.Text.RegularExpressions.Regex.Match(str_Text,"((?<=alt..)[A-Z]+\s*[A-Z0-9.]+(?=.))").Value

Regards

1 Like

Thank you so much @ppr @Yoichi and @vrdabberu

Using the Regular expresion that you provide me help me to solve the situation. on top of that the other Topic worked with @ppr about nested elements was a key factor to understand.

@Yoichi During the debugging process was essential to use WriteTextFile activity

Based on your RegEx Expression I made a reseach about each componenents as follows :

β€œ\b(?<=alt=\u0022)[\s\S]*?(?=\u0022)”
This regular expression is used to match a specific pattern in a string. Let’s break down each component of the expression:

  1. \b: This denotes a word boundary, ensuring that the pattern begins at a word boundary. It matches a position where a word character is followed or preceded by a non-word character (or vice versa).
  2. (?<=alt=\u0022): This is a positive lookbehind assertion ((?<=...)) that checks if the string is preceded by the characters alt= followed by \u0022. \u0022 is the Unicode representation for a double quote character ("). So, this part of the expression ensures that the pattern is preceded by alt=".
  3. [\s\S]*?: This matches any character (\s for whitespace, \S for non-whitespace) zero or more times (*? for a non-greedy match). Essentially, it captures any character (including line breaks and whitespace) between the alt=" and the closing double quote, but in a non-greedy manner, meaning it will stop at the first occurrence of the next part of the pattern.
  4. (?=\u0022): This is a positive lookahead assertion ((?=...)) that checks if the string is followed by \u0022, representing a double quote ("). This ensures that the pattern ends before the closing double quote.

Combining all these elements, the regular expression \b(?<=alt=\u0022)[\s\S]*?(?=\u0022) can be read as:

Match any sequence of characters that:

  • Begins at a word boundary following alt=".
  • Includes any characters (including line breaks and whitespace) reluctantly until it encounters the closing double quote ".
  • Ends just before the closing double quote.

This expression is especially useful for extracting the content within the alt attribute of an HTML tag where \u0022 represents the double quote character.

Perfect, so you can close all related topics by

Thank you for your help

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.