Extracting sentence with set word in it

Hi all

In my process, I’m searching for a word in a PDF. I then want to extract the sentence that it appears in. Every sentence is different though and I have no idea how to extract the sentence, as I can’t, for example, tell it to extract the text between “A” and “B”.

e.g. if the word “information” is being searched, these are the types of sentences it would bring up:

  • This may include for example information regarding the location of buried utilities such as water or gas lines.

  • The contractor should use the Client specified list of formats for all information deliverables (unless otherwise stated).

  • Select data formats to be used for project information delivery should include; Graphical, Non-Graphical and Documentation.

Any ideas how this could be done please?

Thanks :slight_smile:

Hi @Short
Lets try
1-First Try to split your text to sentences.
2-Then use loop over array of sentences
3-inside loop check word (information) through contains
4-if contains you can get easily by index of array

1 Like

Hi @Short ,

We could try by Reading the PDF file using Read PDF Text Activity (Mostly PreserveFormat property enabled), then we can split the text based on new line.

textArray = Split(pdfText,Environment.NewLine).ToArray

Here, textArray is a variable of type Array of String.

Next, we perform the word search and grab the entire sentence like below :

foundSentences = textArray.Where(Function(x)x.tolower.Contains("yourWordToSearch")).ToArray

Here, foundSentences is also an Array of String type variable which should contain your matching sentences.

Check if the above method applies and is satisfying your requirement. Also toggle PreserveFormat property and check the result if it doesn’t work with enabled.

1 Like

Hi @supermanPunch

Amazing, thank you!! I am having one issue though, it’s finding this sentence:

image

But on the PDF it looks like this as it’s in a table:

Is there a way to get around this?

@Short ,

Have you tried toggling the PreserveFormat property and checked the results ?

There are times when we have data in the table format and we might need to adjust the table into the format we require first and then do the operation. But this also would require us to know if the the data is in a particular format for all the pdf files that you would receive.

Alternately, we could help with regex expressions if we have a view on the sample data/ if it is sure that all the data follows a particular pattern.

1 Like

Hi @supermanPunch

PreserveFotmat doesn’t change much, and the PDF will be sent from different companies meaning the format will never be the same unfortunately. Apparently it’s rare that the PDF has table formats in like the one I tested so I don’t think it’ll be an issue.

Thank you so much for all of your help :slight_smile:

1 Like

Hi @supermanPunch

One last question, sorry! At the moment it’s picking up the line “resource” is on, is there a way to get the text from number to number? i.e. get all the text from 41.2 to 41.3, though the numbers will change each time

@Short ,

Yes, It should be possible. But however we would need to look at the format/pattern of the data if we are doing this, so that we have relevant samples to test on.

The method which would be most suitable is regex, with which we can also work around other possibilities.

It would be better to open a new topic on this as many other regex experts could help you on this case.

This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.