Extracting sentence with set word in it

Short · September 28, 2022, 1:48pm

Hi all

In my process, I’m searching for a word in a PDF. I then want to extract the sentence that it appears in. Every sentence is different though and I have no idea how to extract the sentence, as I can’t, for example, tell it to extract the text between “A” and “B”.

e.g. if the word “information” is being searched, these are the types of sentences it would bring up:

This may include for example information regarding the location of buried utilities such as water or gas lines.
The contractor should use the Client specified list of formats for all information deliverables (unless otherwise stated).
Select data formats to be used for project information delivery should include; Graphical, Non-Graphical and Documentation.

Any ideas how this could be done please?

Thanks

raja.arslankhan · September 28, 2022, 1:52pm

Hi @Short
Lets try
1-First Try to split your text to sentences.
2-Then use loop over array of sentences
3-inside loop check word (information) through contains
4-if contains you can get easily by index of array

supermanPunch · September 28, 2022, 2:05pm

Hi @Short ,

We could try by Reading the PDF file using Read PDF Text Activity (Mostly PreserveFormat property enabled), then we can split the text based on new line.

textArray = Split(pdfText,Environment.NewLine).ToArray

Here, textArray is a variable of type Array of String.

Next, we perform the word search and grab the entire sentence like below :

foundSentences = textArray.Where(Function(x)x.tolower.Contains("yourWordToSearch")).ToArray

Here, foundSentences is also an Array of String type variable which should contain your matching sentences.

Check if the above method applies and is satisfying your requirement. Also toggle PreserveFormat property and check the result if it doesn’t work with enabled.

Short · September 28, 2022, 3:24pm

Hi @supermanPunch

Amazing, thank you!! I am having one issue though, it’s finding this sentence:

But on the PDF it looks like this as it’s in a table:

Is there a way to get around this?

supermanPunch · September 28, 2022, 4:28pm

@Short ,

Have you tried toggling the PreserveFormat property and checked the results ?

There are times when we have data in the table format and we might need to adjust the table into the format we require first and then do the operation. But this also would require us to know if the the data is in a particular format for all the pdf files that you would receive.

Alternately, we could help with regex expressions if we have a view on the sample data/ if it is sure that all the data follows a particular pattern.

Short · September 29, 2022, 8:26am

Hi @supermanPunch

PreserveFotmat doesn’t change much, and the PDF will be sent from different companies meaning the format will never be the same unfortunately. Apparently it’s rare that the PDF has table formats in like the one I tested so I don’t think it’ll be an issue.

Thank you so much for all of your help

Short · September 29, 2022, 10:16am

Hi @supermanPunch

One last question, sorry! At the moment it’s picking up the line “resource” is on, is there a way to get the text from number to number? i.e. get all the text from 41.2 to 41.3, though the numbers will change each time

supermanPunch · September 29, 2022, 11:59am

@Short ,

Yes, It should be possible. But however we would need to look at the format/pattern of the data if we are doing this, so that we have relevant samples to test on.

The method which would be most suitable is regex, with which we can also work around other possibilities.

It would be better to open a new topic on this as many other regex experts could help you on this case.

system · October 2, 2022, 11:59am

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
PDF File reading and extract specific sentence containing the key words Help	8	2854	October 9, 2019
Regex advice Studio studio , regex , question , string-manipulation	6	952	October 12, 2022
Finding a word in a document, extracting the sentence it's in Studio studio , question	4	827	August 22, 2022
Extract a specific info from text Studio studio , question , activities_panel	4	501	March 27, 2023
How do i extract a sentence in pdf in Uipath Activities pdf , activities , question	3	872	July 15, 2021

Extracting sentence with set word in it

Related topics