Regex advice

Short · October 12, 2022, 1:26pm

Hi all

I have a problem I’m trying to solve and wondered if anyone could help. What I’m trying to do is read through a PDF document, look for a key word, and extract the whole sentence containing that word, so for example the PDF will look like this:

56.8. Liaise with the incumbent Service Provider to enable the full completion of the mobilisation period;
56.9. Produce and implement a communications plan , to be agreed with the Client, including the frequency, responsibility for and nature of communication with the Client and end users of the service;
56.10. Produce a mobilisation report for each Affected Property to encompass programmes that will fulfil all the Client’s obligations to landlords and other tenants. The format of reports and programmes shall be in accordance with the Client’s requirements. Particular attention shall be paid to establishing the operating requirements of the occupiers in drawing up these programmes for agreement with the Client;

If I was looking for the word “Communication”, I’d want to extract this part - 56.9. Produce and implement a communications plan , to be agreed with the Client, including the frequency, responsibility for and nature of communication with the Client and end users of the service;

At the moment, my process involves searching the PDF page by page and splitting the text by Environment.NewLine ToArray, then looking through each array for “Communication”. As it’s split by NewLine, it isn’t picking up the full sentence.

I wondered if you knew how to split the text between from number to number, e.g. the text from 56.8 to 56.9 (including the number at the start), the text from 56.9 to 56.10 etc. The only issue is that the numbers will change format, i.e. examples of numbers on the PDF are:
1.2.
1.2.3.
34.2.
34.3.16.
117.14.3.
216.2.13.3.

Any help would be gratefully appreciated

Yoichi · October 12, 2022, 1:55pm

Hi,

I wondered if you knew how to split the text between from number to number, e.g. the text from 56.8 to 56.9 (including the number at the start), the text from 56.9 to 56.10 etc.

How about the following expression?

arrString = System.Text.RegularExpressions.Regex.Matches(yourString,"(\d+\.)+[\s\S]+?(?=(\d+\.)+|$)").Cast(Of System.Text.RegularExpressions.Match).Select(Function(m) m.Value).ToArray

note: arrString is string array

Regards,

ptrobot · October 12, 2022, 1:56pm

You could use Regex.Split() with the following pattern (?=^\d+\.)
You need the multiline option for this to work.

Short · October 12, 2022, 2:12pm

You absolute genius, thank you!

Do you know why it doesn’t work on regex101? regex101: build, test, and debug regex I’ve tested it on UiPath and it works perfectly but just wondered why it’s not working on the above?

Yoichi · October 12, 2022, 2:18pm

Hi,

We need to set .NET(C#) at Flavor when we use regex101 for UiPath.
Also need to set global option, in this case.

Regards,

Short · October 12, 2022, 2:26pm

You’re amazing, thank you so much!

system · October 15, 2022, 2:27pm

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
Regex wizards look here! Studio	3	742	December 8, 2020
Regex for Text Extract from PDF Help pdf , activities , regex , string , question	13	2552	January 4, 2020
Reading PDF, Regex/Split string Community question , community	7	652	April 20, 2023
Extract a specific info from text Studio studio , question , activities_panel	4	501	March 27, 2023
Need REGEX code for extracted PDF info Studio studio , question , find_references	4	847	September 20, 2021

Regex advice

Related topics