RegEx: Extact keywords from document

Hi,

I did a data scraping and I now have to extract some keywords in a document. Each of my keywords is in between two minus signs.

Below is an example of it:

The yellow parts are what I need to extract. Can someone please help using RegEx?I have provided the texts below. Thanks!


**Text1:**

WHAT TEAM?

The GSLC Collegiate Oversized T-Shirt is all about those chilled varsity vibes, bringing comfort, style and a big dose of team spirit to your rest day. Join team GSLC, rep your favourite colour, and relax.

- Oversized fit- Crew neck- Short Sleeves- Straight hem- Ombre GSLC logo with puff outline- 95% Cotton, 5% Elastane- We've cut down our use of swing tags, so this product comes without one- Model is 6'0" and wears a size M- SKU: GMST5775-BGR- Made at JM Fabrics Ltd in Bangladesh

**Text2:**

BRING YOUR BEST ⁃ Regular fit ⁃ Lightweight, stretchy material ⁃ Sweat-wicking DRY technology ⁃ Short set-in sleeves ⁃ Straight hem and crew neck ⁃ 51% Nylon, 49% Polyester- Model is 5'11" and wears size S

⁃ Label Colour: White

Bring your best, every single time, in the Vital T-Shirt. A performance-driven design, made to move in any direction - with a lightweight material and sweat-wicking properties - this workout top is one that strives to make your best performance the norm.

**Text3:**

REDEFINING YOUR POTENTIAL

The Arrival T-Shirt is your admission to a new, purposeful approach to training. Created to encourage you to aspire more, perform more and achieve more, this gym top boasts fundamental performance technology for next-level results, with a lightweight polyester-elastane material that enables maximum movement and exertion.

- Slim fit- Lightweight material- Physique-accentuating flatlock seam lines- Straight hem- Set-in sleeves and crew neck- Printed Gymshark logo to chest- 88% Polyester, 12% Elastane- We've cut down on our use of swing tags, so this product comes without one- Model is 5'10" and wears a size S- SKU: GMST5638-WH- Made at RSI Global Limited in Bangladesh

**Text4:**

GET THE JOB DONE ⁃ Regular fit ⁃ Set-in sleeves ⁃ Crew neck ⁃ Straight hem ⁃ Heat-sealed Gymshark logo to chest ⁃ 95% Cotton, 5% Elastane ⁃ Model is 5'10” and wears size S ⁃ Label Colour: BlackA t-hirt for every workout. The Critical Regular Fit T-Shirt will get the job done, every time you need it to, with a durable, stretchy cotton-elastane material and spacious fit. An essential for your next session.

Hi @Yudhisteer_Chintaram1

try out this regex
(-(.)*(?=-)|(?<=⁃ )(.)*(?=⁃))
Thanks,

1 Like

Thanks! But it is scraping more than what I need:

HI @Yudhisteer_Chintaram1

Can you share your expected output if possible?

(Whether you need that complete sentence or separate words)
Regards
Sudharsan

@Yudhisteer_Chintaram1

  1. use below regex
    (fit-(.)(?=-)|(?<=⁃ )(.)(?=⁃))

  2. then in for each loop of regular expression output
    split the string with β€œ-”

  3. store the required results as per your need.

for testing you could use https://regex101.com/

let me know if you need any further help.

thanks @Robinnavinraj_S

Hi @Sudharsan_Ka,

Only the highlighted words. Thanks!

Hi @rahatadi,

Your regular expression is not detecting the keywords?

Can you help please?

my bad. this should work
(fit-(.)*(?=-)|(?<=⁃ )(.)*(?=⁃))

check this one also

1 Like

Hi @rahatadi ,

This one works! But I could not understand your loop workflow. Can you help please?

  1. use below regex
    (fit-(.) (?=-)|(?<=⁃ )(.) (?=⁃))
  2. then in for each loop of regular expression output
    split the string with β€œ-”
  3. store the required results as per your need.

it is simple thing to extract words between the dashes.

i thought the you are receiving all text at once but later i found that those are separate texts.

you may not need those flow

1 Like

Can you share the text file

Hi @rahatadi ,

But here it misses the first word.: β€œmuscle”

Can you help please?

They are only independent texts:

Text1:

WHAT TEAM?
The GSLC Collegiate Oversized T-Shirt is all about those chilled varsity vibes, bringing comfort, style and a big dose of team spirit to your rest day. Join team GSLC, rep your favourite colour, and relax.

- Oversized fit- Crew neck- Short Sleeves- Straight hem- Ombre GSLC logo with puff outline- 95% Cotton, 5% Elastane- We've cut down our use of swing tags, so this product comes without one- Model is 6'0" and wears a size M- SKU: GMST5775-BGR- Made at JM Fabrics Ltd in Bangladesh

Text2:

BRING YOUR BEST ⁃ Regular fit ⁃ Lightweight, stretchy material ⁃ Sweat-wicking DRY technology ⁃ Short set-in sleeves ⁃ Straight hem and crew neck ⁃ 51% Nylon, 49% Polyester- Model is 5'11" and wears size S
⁃ Label Colour: White
Bring your best, every single time, in the Vital T-Shirt. A performance-driven design, made to move in any direction - with a lightweight material and sweat-wicking properties - this workout top is one that strives to make your best performance the norm.

Text3:

REDEFINING YOUR POTENTIAL

The Arrival T-Shirt is your admission to a new, purposeful approach to training. Created to encourage you to aspire more, perform more and achieve more, this gym top boasts fundamental performance technology for next-level results, with a lightweight polyester-elastane material that enables maximum movement and exertion.

- Slim fit- Lightweight material- Physique-accentuating flatlock seam lines- Straight hem- Set-in sleeves and crew neck- Printed Gymshark logo to chest- 88% Polyester, 12% Elastane- We've cut down on our use of swing tags, so this product comes without one- Model is 5'10" and wears a size S- SKU: GMST5638-WH- Made at RSI Global Limited in Bangladesh

Text4:

GET THE JOB DONE ⁃ Regular fit  ⁃ Set-in sleeves ⁃ Crew neck ⁃ Straight hem ⁃ Heat-sealed Gymshark logo to chest ⁃ 95% Cotton, 5% Elastane ⁃ Model is 5'10” and wears size S ⁃ Label Colour: BlackA t-hirt for every workout. The Critical Regular Fit T-Shirt will get the job done, every time you need it to, with a durable, stretchy cotton-elastane material and spacious fit. An essential for your next session.







Hi @Yudhisteer_Chintaram1

Try this remove fit from the regex β€œ(-(.)(?=-)|(?<=⁃ )(.)(?=⁃))”
So that you will get - Muscle also but it will get you another unwanted lines as well

As suggested loop through the β€œ-” if the value length is more than 20
go to other lines

Hope this helps

Regards
Sudharsan