Extract a single value from multiple occurrences matched by regex

Salleh · January 26, 2023, 10:53am

In document understanding “regex based extractor”,
need to extract a name, i’ve used regex “[A-Z][a-zA-Z.]+\s[A-Z][a-zA-Z.]+”, but it matches multiple occurrences, i need to extract the first two occurrences.

Is there a way to limit the regex to a specific portion of text?

supermanPunch · January 26, 2023, 11:15am

Hi @Salleh ,

Could you provide us with the text data that is being used ? So that we can perform the testing from our end and provide you with suggestions.

Salleh · January 26, 2023, 11:19am

Hi @supermanPunch,

=-p-sov=502JU56U_gE 00 AL MASAM STATIONERY L.L.C

OFEICE, SCHOOL & ART SUPPLIES SOLUTIONS

BF

(ESTABLISHED SINCE 1991) TRN : 100205501800003

Invoice No. 80470 Ref.
No. JFZ-PO-22-100983

supermanPunch · January 26, 2023, 12:52pm

@Salleh ,

It seems that your data is extracted from an OCR. However, we would require to know if there is a set pattern that the data follows. Such as, is the Word OFEICE/OFFICE,SCHOOL always be present in the second line ?

In order for us to target the required values, we would need more info on the pattern of data. This can be done by analysing multiple data text/inputs that you would receive.

For now, Could you check with the below Expression (For Extracting only the First Occurrence of the words):

Regex :

(\s[A-Z][a-zA-Z. ]+)+

Although, the match is happening for multiple occurrences, we are splitting the text based on New Line, and we are capturing the First Line only and extracting the textual (continuous Alphabetical text) data.

Expression :

Regex.Match(Split(wordText,Environment.NewLine)(0),"(\s[A-Z][a-zA-Z. ]+)+").Value.Trim

Visuals :

Anil_G · January 26, 2023, 1:22pm

@Salleh

From the extracted data…you can access only 2 by using matches.Take(2) or matches(0) and matches(1) will give you the first two…

Matches being the output of regex activity

Hope this helps

Cheers

Salleh · January 26, 2023, 5:04pm

Thankyou @supermanPunch & @Anil_G

Regex base extractor does not has support of string in document understanding, only regex can be used to acquire required condition.

Regards,

Anil_G · January 26, 2023, 5:41pm

@Salleh

Ahhh…I missed the part of document understanding…

Okay so do you know if the required two strings will come in the first line?

If yes…then you can try this in regex add this before your regex basically add this to match the first row (?<!\n.*)

This is how it works

Hope this helps

Cheers

Mudassar_Majeed · January 26, 2023, 9:56pm

You can use the following regular expression to match the pattern you described:

(?:\b[A-Za-z.]+\b\s){3}\b[A-Za-z.]+\b

Explanation:

(?: and ) - non-capturing group, it will group the pattern but will not capture the match.
\b - word boundary, it asserts that we are at the start of a word or the end of a word.
[A-Za-z.]+ - matches one or more alphabets and/or dot(s)
\s - matches a whitespace character
{3} - matches the preceding element exactly 3 times
\b - word boundary, it asserts that we are at the start of a word or the end of a word.

This regular expression will match the first occurrence of any four consecutive words, containing alphabets and . character, in the given text.

supermanPunch · January 27, 2023, 5:17am

@Salleh ,

Apologies. Could you Check with the below Regex Expression in the Regex Based Extractor. Do enable the Multiline Regex Options.

\A.*?(\s[A-Z][a-zA-Z. ]+)+$\n

Do perform Tests with multiple input samples of data that you receive as well.

Let us know if this does not work.

Salleh · January 27, 2023, 4:37pm

Thankyou @Anil_G & @Mudassar_Majeed
your answers were really helpful.

Thankyou @supermanPunch,
your solution solved my requirement. stay blessed
Regards,

system · January 30, 2023, 4:37pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regex Based Extractor issue Studio	7	572	September 14, 2022
How to configure Regex for multiple matches - Document Understanding Activities activities , question , document_understanding	4	1677	March 30, 2021
Get only the first occurrence in intelligence OCR Help ocr , activities , regex , question	22	2603	November 13, 2019
Regex based extractor extracting all value from all the pages of the pdf Studio studio , question , document_understanding , regex-extractor	25	2659	March 9, 2022
Get the first occurence of Regex Studio studio , question , project_panel	2	2205	June 13, 2022

Extract a single value from multiple occurrences matched by regex

Related topics