How to extraxt specific text from long text paagraph

Dear Friends,

I am facing a challenge and I am seeking your help please as I am unable to tweak my project to work perfect, I have a long list of text which I read from a .txt file that contains repetitive pattern of announcements like the following:

Company ABC

Is selling the following scrap material

EXCAVATION MACHINE

, interested buyers can contact

17369999 or ABDULLA@CompanyABC.COM

Company GGG

Is selling the following scrap material

Coaster BUS

, interested buyers can contact

17123123 or TTT@CompanyGGG.COM

Company DDD W.L.L

And partners

Is selling the following scrap material

Boats Motors

, interested buyers can contact

175654556 or Jeff@CompanyDDD.COM

and the list goes on with hundreds of records…

I need to capture each of the following fields in a separate data record of my database:

Company Name, Scrap Material, Telephone, Email

The way I tried to do, is that I made a loop to go through every 5 lines, and combine them on one text line using CombinText Function, and then I use Regext to extract the fields and save them to the database. The problem I am facing is when I run into more than 5 lines, say 6 or 7? As the loop will not help here. (Plz see the last records which contains 6 lines).

Do you suggest a better approach? Or is there a way you to tweak the loop to handle 6 or more lines. (it might help to note that the last line will always contains an email address, it might end with .com, .net or .in….etc).

Appreciate your help please.

Thanks

Hi @Anwar_Mirza

Check the below workflow:
Regex Task.zip (255.1 KB)

Output:

Hope it helps!!

1 Like

@Anwar_Mirza Just in case you don’t want to go the RegEx way, here is another way
Sample.zip (3.6 KB)

image

Thanks, this is a good approach, but please note that the line does not necessary begins with the word “Company” it can be “Factory”, “Establishment”…etc. Any suggestion to amend the pattern plz?

Thanks for the efforts, I will go through it to understad how it works, some of the activities are new to me. It seems working fine but i will try a real data scenario and revert back. Thanks bro.

1 Like

Hi @Anwar_Mirza

Give me some time i will give the updated flow.

Regards

Hello @Anwar_Mirza

Try this Regex Pattern:
(?<=(Company|Factory|Establishment)\s)([\s\S]+?)[\n\r]+Is selling the.+[\n\r]+(.+)[\n\r]+.+[\n\r]+(\d+).+(\S+@\S+)

This will capture each result and then group each component separately:

(?<=(Company|Factory|Establishment)\s)([\s\S]+?)[\n\r]+Is selling the.+[\n\r]+(.+)[\n\r]+.+[\n\r]+(\d+).+(\S+@\S+)

Cheers

Steve

Thanks Pravathy, and I’d appreciate if it can capture a generic entity name, I mean it does not have to start with company/establishment/factory …etc. Maybe the Regex for capturing the entity name can be “All the text BEFORE the words ‘has a scrap’…”, do you think this is doable?

Thank you for the efforts

Thank you dear Steve, plz see my message to Pravathy below. The Regex pattern will still be limted in my case if new types of entities are added to the text file, I need it to apture any entity name before the words “has a scrap…”

Thank you for the kind help.