Extract a single value from multiple occurrences matched by regex

In document understanding “regex based extractor”,
need to extract a name, i’ve used regex “[A-Z][a-zA-Z.]+\s[A-Z][a-zA-Z.]+”, but it matches multiple occurrences, i need to extract the first two occurrences.

Is there a way to limit the regex to a specific portion of text?

Hi @Salleh ,

Could you provide us with the text data that is being used ? So that we can perform the testing from our end and provide you with suggestions.

Hi @supermanPunch,

=-p-sov=502JU56U_gE 00 AL MASAM STATIONERY L.L.C

OFEICE, SCHOOL & ART SUPPLIES SOLUTIONS

BF

(ESTABLISHED SINCE 1991) TRN : 100205501800003

Invoice No. 80470 Ref.
No. JFZ-PO-22-100983

@Salleh ,

It seems that your data is extracted from an OCR. However, we would require to know if there is a set pattern that the data follows. Such as, is the Word OFEICE/OFFICE,SCHOOL always be present in the second line ?

In order for us to target the required values, we would need more info on the pattern of data. This can be done by analysing multiple data text/inputs that you would receive.

For now, Could you check with the below Expression (For Extracting only the First Occurrence of the words):

Regex :

(\s[A-Z][a-zA-Z. ]+)+

image

Although, the match is happening for multiple occurrences, we are splitting the text based on New Line, and we are capturing the First Line only and extracting the textual (continuous Alphabetical text) data.

Expression :

Regex.Match(Split(wordText,Environment.NewLine)(0),"(\s[A-Z][a-zA-Z. ]+)+").Value.Trim

Visuals :
image

@Salleh

From the extracted data…you can access only 2 by using matches.Take(2) or matches(0) and matches(1) will give you the first two…

Matches being the output of regex activity

Hope this helps

Cheers

Thankyou @supermanPunch & @Anil_G

Regex base extractor does not has support of string in document understanding, only regex can be used to acquire required condition.

Regards,

@Salleh

Ahhh…I missed the part of document understanding…

Okay so do you know if the required two strings will come in the first line?

If yes…then you can try this in regex add this before your regex basically add this to match the first row (?<!\n.*)

This is how it works

Hope this helps

Cheers

1 Like

You can use the following regular expression to match the pattern you described:

(?:\b[A-Za-z.]+\b\s){3}\b[A-Za-z.]+\b

Explanation:

  • (?: and ) - non-capturing group, it will group the pattern but will not capture the match.
  • \b - word boundary, it asserts that we are at the start of a word or the end of a word.
  • [A-Za-z.]+ - matches one or more alphabets and/or dot(s)
  • \s - matches a whitespace character
  • {3} - matches the preceding element exactly 3 times
  • \b - word boundary, it asserts that we are at the start of a word or the end of a word.

This regular expression will match the first occurrence of any four consecutive words, containing alphabets and . character, in the given text.

1 Like

@Salleh ,

Apologies. Could you Check with the below Regex Expression in the Regex Based Extractor. Do enable the Multiline Regex Options.

\A.*?(\s[A-Z][a-zA-Z. ]+)+$\n

Do perform Tests with multiple input samples of data that you receive as well.

Let us know if this does not work.

1 Like

Thankyou @Anil_G & @Mudassar_Majeed
your answers were really helpful.

Thankyou @supermanPunch,
your solution solved my requirement. stay blessed
Regards,

1 Like

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.