Looking assistance in regex

Hi Brilliant Team,
Need your assist with regex. I have tried but it’s not extracted properly. Could you please assist me someone?
With the partition of India in 1947, it became the Pakistani province of East Bengal (later renamed East Pakistan), one of five provinces of Pakistan, separated from the other four by 1,100 miles (1,800 km) of Indian territory. In 1971 it became the independent country of India, with its capital at.

look With the partition of India in 1947, it became the Pakistani province of East Bengal (later renamed East Pakistan), one of five provinces of Pakistan, separated from the other four by 1,100 miles (1,800 km) of Indian territory. In 1971 it became the independent country of india, with its capital at.

With the partition of India in 1947, it became the Pakistani province of East Bengal (later renamed East Pakistan), one of five provinces of Pakistan, separated from the other four by 1,100 miles (1,800 km) of Indian territory. In 1971 it became the independent country of India, with its capital at.

we are reading the news and after data massaging we will published in portal. The data is very confidential. We need to check it one by one.

we are reading the news and after data massaging we will published in portal. The data is very confidential. We need to check it one by one.

Output would be data table: like below

![image|690x453](upload://3M2wYbsCy

I have tried with regex but it’s not working properly.
I am looking your help.
Thanks

Hi @lisa_R ,

Could you maybe try the below Suggestion :

  1. As the Multiple Capturing data is present within the same whole text, We can use the Split Method and Separate each section and then use the Title and Depression Regex to capture the data.
System.Text.RegularExpressions.Regex.Split(strInput1,"(?=Title:)" )

Here, strInput1 is a string variable containing the whole text data. The output from the above expression is an Array of String, which we can use it to loop through a For Each Loop Activity.

  1. Using the Expression above in the For Each Loop activity is shown below. We then Capture the Title and Description for each splitted Section.

Expressions to Capture Title and Description :

title = System.Text.RegularExpressions.Regex.Match(currentItem,"(?<=Title:).*").Value.ToString.Trim
Description = System.Text.RegularExpressions.Regex.Match(currentItem,"(?<=Description:)[\s\S]+").Value.ToString.Trim

image

  1. Next, we add this data to the Datatable using Add Data Row Activity.
    image

  2. At the end, Outside the For Each Loop, we can use Write Range Activity and write the datatable to an Excel sheet.

Note: The NormalDT used was built at the start using the Build Datatable Activity containing the columns Title and Description

Hi @supermanPunch thanks for your quicker reply. Description is not working where i did stuck too. It should be extract all the paragraph under description. But it extracted all of the text below description which is added another title too.

@lisa_R ,

Were the Steps suggested followed ?

Firstly, we will be Splitting the Sections (From Title to Character Before Next Title word). The Regex Split does this and provides us with an Array of Splitted Sections, Then we can Apply the Regex mentioned on the Splitted items so that we won’t be getting all the data.

I do get the Output in the below way :

Hi,

Another approach:

Can you try the following sample?

dt = System.Text.RegularExpressions.Regex.Matches(yourString,"Title:[\s\S]+?(?=Title:|$)").Cast(Of System.Text.RegularExpressions.Match).SelectMany(Function(m) System.Text.RegularExpressions.Regex.Matches(System.Text.RegularExpressions.Regex.Match(m.Value,"(?<=Description:\s*)[\s\S]+").Value,"[^\r\n]+").Cast(Of System.Text.RegularExpressions.Match).Select(Function(m2) dt.LoadDataRow({m.Value.Split(chr(10)).First,m2.Value},False))).CopyToDataTAble

Sample20230310-6L.zip (3.3 KB)

Regards,

Hi @Yoichi thanks a lot. It’s working perfectly. Could you please elaborate functionality where i can modify and extract expected output?
Could you please suggest the tutorial where i can learn this functionality?

Hi,

In order to improve readability, set “Ssytem.Text.RegularExpression” at Import Tab in advance. And added linebreak as the following expression.

Regex.Matches(yourString,"Title:[\s\S]+?(?=Title:|$)").Cast(Of Match) _
    .SelectMany(Function(m) _
             Regex.Matches(Regex.Match(m.Value,"(?<=Description:\s*)[\s\S]+").Value,"[^\r\n]+").Cast(Of Match) _
        .Select(Function(m2) _
	        dt.LoadDataRow({m.Value.Split(chr(10)).First,m2.Value},False) _
	    ) _
    ).CopyToDataTable

The first Regex.Matches method extracts strings which starts with “Title:”. It’s assigned to m by the 2nd line.
Next, the second (and 3rd) regex extracts each description after “Description”. it’s assigned to m2 by the 4th line.
Then, dt.LoadDataRow method returns datarow which has the first line of m and m2
Finally, these datarows are converted to DataTable by CopyToDataTable.

Could you please suggest the tutorial where i can learn this functionality?

The above using Regex and LINQ.If you are not very familiar with these function. it may be good to start to check the following documents.

Regards,

Thanks @Yoichi its a great help.

@supermanPunch thanks too.
you guys both are genius.

2 Likes

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.