Regex to match first two words of multiple lines

Hi I am reading text from pdf and using regex to find the first two words from multiple lines which is in between two lines. Here is the text:

ID Name Quantity Value Description Weight (kg)
ADD 160 ADD - Supp- Power Rack Box (RED) 500.00
TDD 260 TDD - Supp - Leaf Green (GREEN) 200.00
ZTP 360 ZTP - Supp - Fire Red (RED) 100.00

Carrier: GST Limited

I want to have output as:
ADD 160
TDD 260
ZTP 360

The multiple lines could be one or many so everytime line gets added or removed it should give me an output accordingly.
I tried using this (?<=Weight\s(kg)\n)(\w+\s+\w+)(?=[\s\S]*?Carrier:\s) but it gives me just one line output as ADD 160

maybe the following pattern strategy will fit:
grafik

1 Like

Hi @RamboRocky

Use the Find Matching Patterns activity and give the below regular expression,

^[A-Z\s]+[0-9]+

image

Open the Properties of Find Matching Patterns activity and find the Pattern options dropdown. Check the Multiline option also.
image

Hope it helps!!

1 Like

Hi mate thanks. It works on the regex builder website… but on matching patterns on studiit doesn’t work. Can you try on studio and see if it works?

Same here it works on regex builder but doesn’t match on regex patter in studio… Can you try on studio if it picks up please?

Hey @RamboRocky

Hey use below mentioned regex:

\b[A-Z]+\s+\d+\b

Take a assign activity:

arr_Output = System.Text.RegularExpressions.Regex.Matches(str_Input,"\b[A-Z]+\s+\d+\b").Select(Function(x) x.Value).ToArray

Above mentioned will give output as Array of String.
Note: arr_Output datatypr is Array of String!

Screenshot for your reference:

Output:

Code:

You will get the output!

Regards,
Ajay Mishra

1 Like

It was working for me @RamboRocky

Are you using the Find Matching Patterns activity. If yes then follow the below,
→ Drag and drop the Find Matching Patterns activity.
→ Click on Configure Regular expression option, select the advanced option in the dropdown.
→ Give the Below regular expression in value field,

^[A-Z\s]+[0-9]+

→ Open the Properties of Find Matching Patterns activity and find the Pattern options dropdown. Check the Multiline option also.
image
→ Create a variable in the Result option in properties.
→ Use for each activity to iterate the output variable to get the each value.

Check the below output panel image for better understanding,
image

Hope it helps!!

1 Like

Hi All, Thanks for your replies. All expression works :slight_smile:

Hi, I have one question. Is there any way we can create datable for the output so that we can store the values to the header ID Name like ADD 160 to header ID and Name?

Regards

@RamboRocky
Looks working:
grafik

1 Like

Hi what if the text are like this?

ID Name Quantity Value Description Weight (kg)
ADD 160 ADD - Supp- Power Rack Box (RED) 500.00
G5 80 TDD - Supp - Leaf Green (GREEN) 200.00
Bulb 1 ZTP - Supp - Fire Red (RED) 100.00

Carrier: GST Limited

grafik

we recommend to analyse more the expected variations of the samples or to check for a very generic delimiter-focused approach like

grafik

@RamboRocky
Please try this patter "(?:^|\b)[A-Z0-9]+\s+\d+\b" works in both the cases.

Cheers,
Mounika

Hey @RamboRocky

You can use this: \b[A-Za-z0-9]+\s+\d+\b in your regex builder!

Input:
“ID Name Quantity Value Description Weight (kg)
ADD 160 ADD - Supp- Power Rack Box (RED) 500.00
G5 80 TDD - Supp - Leaf Green (GREEN) 200.00
Bulb 1 ZTP - Supp - Fire Red (RED) 100.00
ID Name Quantity Value Description Weight (kg)
ADD 160 ADD - Supp- Power Rack Box (RED) 500.00
TDD 260 TDD - Supp - Leaf Green (GREEN) 200.00
ZTP 360 ZTP - Supp - Fire Red (RED) 100.00”

Or In Assign Activity:

arr_Output = System.Text.RegularExpressions.Regex.Matches(str_Input,"\b[A-Za-z0-9]+\s+\d+\b").Select(Function(x) x.Value).ToArray

Output:
image

Regards,
Ajay Mishra

Hi Peter,

So sorry I should have analyse more variations before posting. I finally got the variations now… There shouldn’t be any variations apart from it. I hope you don’t mind to have a final look at the below text.

ID Name Quantity Value Description Weight (kg)
ADD 160 ADD - Supp- Power Rack Box (RED) 500.00
G5 80 TDD - Supp - Leaf Green (GREEN) 200.00
Bulb 1 ZTP - Supp - Fire Red (RED) 100.00
Bundle KB (25 pcs) 1 ZTP - Supp - Fire Red (RED) 100.00
Bundle PKCTL (18 pcs) 2 ZTP - Supp - Fire Red (RED) 100.00
BD 1 ZTP - Supp - Fire Red (RED) 100.00

Carrier: GST Limited

Is there any way we could get first two words apart from if bundle is present just get KB 1 and PKCT 2 ?
So sorry for inconvenice caused.

Kind Regards

Hello @RamboRocky,

can you please try this patter: "^\S+\s+\S+(?:\s+\(.+?\))?\s+\d+\b|(?:^|\b)[A-Z0-9]+\s+\d+\b"

Cheers,
Mounika

1 Like

maybe it is better to first extract the data block by:


And then work on the extracted datablock with
grafik
for extracting the a too general pattern
which we can later post process and cleansing by

for getting a consistent pattern of
grafik

1 Like

Regex to match first two words of multiple lines.zip (3.2 KB)
Hi @RamboRocky,

Please check the workflow

1 Like

Thanks for doing this mate. Almost there. I didnt get the Substitution part how do we do that in studio? Could you check my workflow please. I have just bit confused on how to remove the bundle bit.

This is text file
output.txt (366 Bytes)

This is xaml file
Main.xaml (13.5 KB)

Hi, This gives me output as :slight_smile:
image

What I want is :
ADD 160
G5 80
Bulb 1
KB 1
PKCTL 2
BD 1

I want to remove Bundle and the ones inside bracket. Thanks