In my process, I’m searching for a word in a PDF. I then want to extract the sentence that it appears in. Every sentence is different though and I have no idea how to extract the sentence, as I can’t, for example, tell it to extract the text between “A” and “B”.
e.g. if the word “information” is being searched, these are the types of sentences it would bring up:
This may include for example information regarding the location of buried utilities such as water or gas lines.
The contractor should use the Client specified list of formats for all information deliverables (unless otherwise stated).
Select data formats to be used for project information delivery should include; Graphical, Non-Graphical and Documentation.
Any ideas how this could be done please?
1-First Try to split your text to sentences.
2-Then use loop over array of sentences
3-inside loop check word (information) through contains
4-if contains you can get easily by index of array
Hi @Short ,
We could try by Reading the PDF file using
Read PDF Text Activity (Mostly
PreserveFormat property enabled), then we can split the text based on new line.
textArray = Split(pdfText,Environment.NewLine).ToArray
textArray is a variable of type Array of String.
Next, we perform the word search and grab the entire sentence like below :
foundSentences = textArray.Where(Function(x)x.tolower.Contains("yourWordToSearch")).ToArray
foundSentences is also an Array of String type variable which should contain your matching sentences.
Check if the above method applies and is satisfying your requirement. Also toggle
PreserveFormat property and check the result if it doesn’t work with enabled.
Amazing, thank you!! I am having one issue though, it’s finding this sentence:
But on the PDF it looks like this as it’s in a table:
Is there a way to get around this?
Have you tried toggling the
PreserveFormat property and checked the results ?
There are times when we have data in the table format and we might need to adjust the table into the format we require first and then do the operation. But this also would require us to know if the the data is in a particular format for all the pdf files that you would receive.
Alternately, we could help with regex expressions if we have a view on the sample data/ if it is sure that all the data follows a particular pattern.
PreserveFotmat doesn’t change much, and the PDF will be sent from different companies meaning the format will never be the same unfortunately. Apparently it’s rare that the PDF has table formats in like the one I tested so I don’t think it’ll be an issue.
Thank you so much for all of your help
One last question, sorry! At the moment it’s picking up the line “resource” is on, is there a way to get the text from number to number? i.e. get all the text from 41.2 to 41.3, though the numbers will change each time
Yes, It should be possible. But however we would need to look at the format/pattern of the data if we are doing this, so that we have relevant samples to test on.
The method which would be most suitable is regex, with which we can also work around other possibilities.
It would be better to open a new topic on this as many other regex experts could help you on this case.
This topic was automatically closed 3 days after the last reply. New replies are no longer allowed.