How to get text from a PDF surrounding a specific keyword

In my process I need to get specific text surrounding a keyword. See example below, the keyword is ‘process’ and I need to get the text from the first number up to the full stop after the word services. I have managed to get the text between the full stop after the number 4 and the full stop after the word services but i also need the numbesr before the number 4. Any help very much appreciated, thanks.

107.2.4. Be required to manage and agree all consents of Others as part of this process (for example landlords) before commencing works or services. All permits to work shall be supported by full risk assessments and method statements for undertaking the work.

Hi @raychel.hall ,

Kindly try to share Exempla Input and Output.

Thanks,
RajKumar

Hello @raychel.hall ,

One simple approach using basic UiPath activities would be:

The output being:

I’m assuming that the text you’ve managed to get the text between the full stop after the number 4 and the full stop after the word services was by splitting the initial text by ". "

You could amend the if item.Substring(0,1).IsNumeric in order to fit all the needs. In this case the check is to see if the first character is numeric, assuming that a normal phrase will not start with a digit…

Hope it helps!
Best regards,
Marius

Hey @raychel.hall

Regex should help here…

I have created an expression based on the example you have provided.

([\d.]{2,}\s[a-zA-Z\s\d()]*.)

The regex will have to be updated based on your actual requirement though. You can try it out in the website .NET Regex Tester - Regex Storm. It requires multiple examples to make the regex more stable…

Step 1: Import all libraries.
Step 2: Convert PDF file to text format and read data.
Step 3: Use “. findall()” function of regular expressions to extract keywords.

Regards,
Will

Thanks Marius.

The first step of the process is to locate a key word, in the previous example it is ‘process’.
Next I need to get the full sentence that the key word is part of, in the previous example it is ‘Be required to manage and agree all consents of Others as part of this process (for example landlords) before commencing works or services’.
Then I need to include the paragraph number, in the previous example it is ‘107.2.4.’

So out of the paragraph…
‘107.2.4. Be required to manage and agree all consents of Others as part of this process (for example landlords) before commencing works or services. All permits to work shall be supported by full risk assessments and method statements for undertaking the work.’

As the key word is process, I need to pull out the part that says…
‘107.2.4. Be required to manage and agree all consents of Others as part of this PROCESS (for example landlords) before commencing works or services.’

In addition, the numbers may not always be in the above format, below is another example
‘143.4. The services delivered outside of the agreed cleaning operational hours shall be managed via the Service Order process on instruction by the Service Manager.’